Wan proxy - deduplication

Juli Mallett jmallett at freebsd.org
Thu Oct 16 20:07:03 PDT 2014


On Thu, Oct 16, 2014 at 12:34 PM, Charles Roy <charroyit at gmail.com> wrote:

> Hi,
>
> Given 10 bytes is being used to identify 2048 bytes wouldn't the 10 byte
> chuck identifiers have to overlap at some point?
>
> Please help me understand how this works or what happens
>

Charles,

If a full 10 bytes were being spent on chunks, that'd be 2^80 possible
values, at which point I'd say that worrying about collisions/reuse was
exceedingly-silly :)

As it happens, 64 bits are used for identifiers, i.e. 8 bytes or 2^64
possible values.  Here again, an ideal scheme would likely never overlap.
However, we're not using an ideal scheme, for reasons of speed and to
reduce the amount of data we need to send on the wire (the identifier can
be inferred entirely from the data -- it is derived from a hash of the data
-- so we can just say "learn the next 2048 bytes" and not "learn the next
2048 bytes and name them Zod" or whatever.)

Collisions do happen, although not with a very high frequency.  When a
collision happens, we simply slide the window ahead by a single byte until
we find an offset at which we can start a block without a collision.

Actually, we scan through all possible blocks, looking at byte 0 to byte
2047, byte 1 to byte 2048, etc., on up to byte 2047 to byte 4094.  If any
of them are known, we use the first hit we find.  Otherwise, we scan
through the end, and use the first chunk which did not collide (most often
the first chunk.)  In the worst case, if every possible chunk would
collide, we send it without deduplicating and move on to the next chunk.

Since we allow mixing compression with deduplication, even this worst case
doesn't end up being very bad.  One could construct really bad cases, but
they don't happen in practice as far as I've seen.

It's a great question, so I elected to CC the mailing list, which I think
would benefit from seeing this and likely others in the future will wonder
also.

Thanks,
Juli.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.wanproxy.org/pipermail/wanproxy-wanproxy.org/attachments/20141016/2a8af794/attachment.htm>


More information about the wanproxy mailing list