Crytpograph-Hash-Algorithm-Secured Mirroring

Fedora/CentOS/Ubuntu Mirrors Lost

An unfortunate drive failure has cost us our test instances and all archive history for Fedora, CentOS, and Ubuntu. I am partially to blame in this (bad luck is to blame for the other part). When ordering and setting up the drives, I ordered only three drives, and put them in JBOD. I should have scrounged around for a fourth drive to make it a RAID system (then we'd at least still have our data).

In March, I purchased several 1TB hard drives for use in the CHASM project (using my research stipend from the Rensselaer Center for Open Source). Since then, my backup workstation (currently a headless server) has been syncing Fedora and Ubuntu every six hours and making a snapshot of the results. Our goal is to have an accurate profile of the nature of the changes in each mirror archive so that we may decrease latency better.

I'm looking into RMA'ing the drive as quickly as possible. In all my years of purchasing hard drives, this is only the second Western Digital drive I've had to RMA (the first was damaged in shipping). I'm hoping the folks at either Newegg or Western Digital can help with a fast turn around so that we can be back in business.

I'd like to just add that Western Digital makes damn fine products, and it's only because of their products that this work has been possible so far. I personally own seven WD1001FALS (three are used for CHASM), two WD20EADS, two WD6400AAKS, and two WD800 drives. It would not be a stretch to say that I'm a WD fan.


Rsync is Wasteful

When two nodes synchronize via rsync, there is a significant amount of overhead.

Re-computation of Deltas

Consider when comparing files based upon time stamps. On each synchronization, directory entry's timestamps are compared to determine which files to exchange. When doing a synchronization between two hosts without any knowledge other than the root paths on each end, this cost is unavoidable.

In a mirroring network (like those used in Fedora, Debian, etc.), a small number of files are changed or added on each update, but updates occur frequently. If the master server were to pre-calculate the changes, then every downstream client would be able to know which files to request before connecting to an upstream server.

If all clients are querying a tracker that can advertise new changes, then this the reduced latency in each link will lead to overall reduced latency as well. Consider a network with three tiers of clients. The tracker can advertise changes to tier 1. When all tier 1 clients have the updates, it can begin to advertise the updates to tier 2 clients. At each update rollout, downstream nodes can efficiently transfer data from upstream nodes without ever having to compare the contents of the archive on each end of the connection (a simple bitfield of files will suffice).

Lack of a Checksum Cache

Fedora's sample rsync configuration explicitly turns off the checksum feature of rsync. The rationale behind this (please note that I agree with this rationale) is that it prevents remote users from triggering a checksum operation over the entire tree. This operation requires every file on the tree to be read from disk and run through the checksum algorithm used.

Thus, in the interest of saving I/O subsystems mirrors disable checksum support. Without checksum support, rsync falls back to using timestamps. For security reasons (I'll explain in a later post), timestamps are less than adequate for detecting changes.

Reliance Upon Picking a Fast Upstream

With rsync, the choice of upstream mirror is critical to maintaining an keeping up-to-date with updates. A mirror will never see updates before its upstream does. In the worst case, the leaf nodes in the network will always be d (the depth of the tree) synchronizations behind (assuming new content is pulled at each sync). If there is a critical update that must be pushed immediately, a O(6d) latency (assuming four synchronizations per day) is not acceptable.

What CHASM Will Change

In this section I will provide a brief overview of how CHASM addresses each of the above issues.

CHASM's master node generates a manifest as part of the update process. Each manifest describes the absolute state of the archive in terms of directories, symbolic links, and files. Each directory or file object has an associated traditional unix file mode (ugo-rwx). Files are specified by a cryptographic hash (where the CH in CHASM comes from).

A client simply has to read the manifest and compare it to local data on the system. Combine this with a local pool of files identified by their hash value (like Git's .git/objects/ directory), and CHASM is able to quickly identify which files in a manifest must be transferred. Even a mirror that is months out of date could easily avoid re-transferring any data that is still valid.

The solution to the re-computation of deltas adequately provides a checksum cache, and solves the issue of hard-links. As each file is identified by its checksum, every duplicate of the file will be replaced with a hard-link as a matter of course.

Mirrors can query for updates quite frequently without significant penalty. A full manifest of the Fedora project takes up less than 100MB without compression. Compression reduces this amount (more on this in a later post). Creating a diff between manifests reduces the size further.

The Fedora infrastructure already includes support for detecting closest-mirrors using BGP. We'd like to leverage this to allow clients to dynamically pull updates from close neighbors so as to minimize bandwidth costs and hopefully maximize throughput.

Looking to the Future

CHASM is still a work in progress. Everything I wrote was in the above document was written in present tense for clarity. Many of the features exist as ideas, and are not reflected in the code.

If you're looking to help, please consider submitting to our CHASM Nightly Dashboard (nightly build instructions).

Update: We're also looking for students and support in the Fedora Summer of Code (CHASM)



Nightly Build Instructions

CHASM has a dashboard that tracks the current status of the repository. There are three kinds of builds:

  • Experimental --- Builds done by developers on working copies of the repository. This is the target build type for experimental branches (usually dev/*).
  • Continuous --- Builds done for each commit to the repository. We currently do not have one set up.
  • Nightly --- Tests done each night. These are usually done by cron jobs on build machines. This post is about how to set one up.

To set up a nightly build server, the following requirements are needed on the machine (I'm using Fedora's package names, your distro may differ):

  • nss-devel (we use 3.12)
  • boost-devel (we use 1.39)
  • cmake (we use 2.6.4)
  • git (we use 1.6.6)

If your distribution ships other versions of the above packages, please do not change to these versions. Our Dashboard is the best method we have of judging portability of CHASM.

In our repository there's the file chasm-nightly. Download this file and change the variables to fit your system better. They are as follows:

  • CHASM_NIGHTLY_SOURCE_DIR --- The directory where you would like the checkout of CHASM to reside.
  • CHASM_NIGHTLY_BINARY_DIR --- The directory where you would like the build of CHASM to reside.
  • CHASM_SITE --- The name of this build system as the dashboard sees it.
  • CHASM_BUILD_NAME --- A description of the build system. Something like OSTYPE-VERSION-ARCH-COMPILER-COMPILER_VERSION would be most helpful. For example, a Fedora 13 x86_64 machine would be something like Fedora-13-x86_64-g++-4.4.3.
  • CHASM_BUILD_TYPE --- Debug is best since it generates debugging symbols for later use.
  • CHASM_BUILD_OPTIONS --- Build options needed for your machine. This probably doesn't need changing. There is a line in there for the convenience of those running a Fedora x86_64 install where LIB_SUFFIX must be set.
  • CHASM_TEST_MEMCHECK --- Run valgrind over the unit tests. This helps us catch memory leaks in CHASM.
  • CHASM_TEST_COVERAGE --- Run the tests through gcov to get an idea for how well our tests are covering our code.

With this file, set up a cron job to run this script at 00:15 UTC. It will automatically grab the nightly tag (done at 23:55 UTC every night), build it, and submit the results to the dashboard.

If there are any issues with this setup, please report bugs and I will attend to them.

We are looking for machines of a wide variety to test CHASM. Currently it is just my desktop running Fedora 12 x86_64 running tests every night. For a up-to-date listing of servers that we have seen submissions from, please refer to our wiki page for it.


The CHASM Blog

So, after focusing on other things (like finishing Firmant 0.1.0, and the UPE programming competition), I've finally gotten around to putting up this development blog.

CHASM has been an idea I've had since August, 2009 that started with unpleasant experiences mirroring with rsync. Unpleasant meaning that I often would have to remove temporary files left by other mirrors, and was always unsure whether the state of my mirror was consistent with upstream.

CHASM stands for the Cryptographic-Hash-Algorithm-Secured Mirroring solution, and provides the following improvements to the current alternatives:

  • Uses SHA2 to uniquely identify files: Each file is stored by SHA2 and hard-linked into place on the filesystem.
  • Cache-aware transfer protocol: When an upstream and a downstream node synchronize, they take care not to invalidate the filesystem cache of the upstream in order to minimize disk writes.
  • Reduced latency: A trivial update will allow a node to compute what is out of date without any additional network traffic.

This project is generously funded by the Rensselaer Center for Open Source and its donor, Mr. Sean O'Sullivan.

CHASM has been accepted into the Google Summer of Code program. We're still looking for additional students who may wish to work with us as GSoC Participants. More information on this can be found on the Wiki.

Now that we have a blogging solution that doesn't offend our sensibilities (Firmant), we'll be updating this blog often.


Copyright © 2010 Robert Escriva ¦ Powered by Firmant