RSYNC algorithm

Sun Dec 27 12:51:23 PST 2009

Hi Ravi,

On Sun, Dec 27, 2009 at 12:32, Ravi Chunduru <ravi.is.chunduru at gmail.com> wrote:
> I am tring to understand the code. Correct me if I am wrong.
>
> I did not find any MD5 or MD5 hash calculation.
> One of the ways RSYNC algorithm ensures that duplicate data is by
> rechecking with the stronger checksum using MD5 or MD4 after matching
> up with the rolling checksum (weak checksum).  I did not find anywhere
> in the code of this additional check.  I understand that the encoder
> is taking care of this issue by comparing the new & stored block data.
>  What I am not sure is how decoder is ensuring that it is not picking
> up the wrong data (matching the weak checksum)?  From my little
> understanding of the code, it apperas that encoder is sending the weak
> checksum to the decoder whenever there is a match on the encoder.

Great question!  There are a couple of things involved with what
you're asking, so I will answer the easy (i.e. working) case first and
move on to the more complex case and why it will be solved a different
way -- although I haven't done the work yet.

Checking a cryptographic or message digest hash is a fairly expensive
operation (because it must be computed if it is not needed, and it is
not needed here) and still does not guarantee against collision.
Consider this code:

%%%
                        /*
                         * Now attempt to encode this hash as a reference if it
                         * has been defined before.
                         */
                        BufferSegment *oseg = cache_->lookup(hash);
                        if (oseg != NULL) {
                                /*
                                 * This segment already exists.  If it's
                                 * identical to this chunk of data, then that's
                                 * positively fantastic.
                                 */
                                if (encode_reference(output, &outq,
start, hash, oseg)) {
                                        oseg->unref();

                                        o = 0;

                                        /*
                                         * We have output any data
before this hash
                                         * in escaped form, so any
candidate hash
                                         * before it is invalid now.
                                         */
                                        have_candidate = false;
                                        continue;
                                }

                                /*
                                 * This hash isn't usable because it collides
                                 * with another, so keep looking for something
                                 * viable.
                                 */
                                oseg->unref();
                                DEBUG(log_) << "Collision in first pass.";
                                continue;
                        }
%%%

If encode_reference() returns false, then we infer that there was a
hash collision and move on to trying the next hash.  If you look at
the beginning of encode_reference():

%%%
        uint8_t data[XCODEC_SEGMENT_LENGTH];
        input->copyout(data, offset, sizeof data);

        if (!oseg->match(data, sizeof data))
                return (false);
%%%

The BufferSegment::match() method here is the one of note.  There is a
little indirection in buffer.h, but it is basically turned into:

%%%
                return (memcmp(data(), buf, len) == 0);
%%%

So you see that instead of calculating and comparing a hash, we do a
full compare of the bytes to see if there is a collision before using
a hash.

There is, of course, the problem that a reference to a hash may be
sent across the network to a system which has different data stored by
that hash.  A strong checksum isn't a good solution here, though,
because it is so slow and still has a problem with collisions --
however infrequent.  If we think about it, there is only one way this
can happen:

The encoder that is encoding the data must not have been the one that
caused that data to be stored by hash.  The protocol works around this
a little by detecting when the decoder is told to learn a hash -- it
checks for collision byte-by-byte -- but that is imperfect and (right
now) can't do anything but reset the connection -- I'd like for it to
be able to record the remote encoder's wrong idea about what that hash
refers to, but that's unnecessary if we solve the root problem.  Note
that the problem is the same whether the different encoder was on the
same system which has restarted or is another active system on the
network.

The right solution, I think, is to generate a UUID at startup and to
concatenate that with the hash when we store it in a dictionary, and
to exchange UUIDs when connections are created.  This way we never
send a reference to a hash that the remote side might have a different
meaning for than the one we told it.  Does that make sense?

Thanks,
Juli.