<div dir="ltr"><div>On Thu, Oct 16, 2014 at 12:34 PM, Charles Roy <span dir="ltr"><<a href="mailto:charroyit@gmail.com" target="_blank">charroyit@gmail.com</a>></span> wrote:<br></div><div class="gmail_extra"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">Hi,<div><br></div><div>Given 10 bytes is being used to identify 2048 bytes wouldn't the 10 byte chuck identifiers have to overlap at some point? </div><div><br></div><div>Please help me understand how this works or what happens</div></div></blockquote><div><br></div>Charles,<div><br></div><div>If a full 10 bytes were being spent on chunks, that'd be 2^80 possible values, at which point I'd say that worrying about collisions/reuse was exceedingly-silly :)</div><div><br></div><div>As it happens, 64 bits are used for identifiers, i.e. 8 bytes or 2^64 possible values.  Here again, an ideal scheme would likely never overlap.  However, we're not using an ideal scheme, for reasons of speed and to reduce the amount of data we need to send on the wire (the identifier can be inferred entirely from the data -- it is derived from a hash of the data -- so we can just say "learn the next 2048 bytes" and not "learn the next 2048 bytes and name them Zod" or whatever.)</div><div><br></div><div>Collisions do happen, although not with a very high frequency.  When a collision happens, we simply slide the window ahead by a single byte until we find an offset at which we can start a block without a collision.</div><div><br></div><div>Actually, we scan through all possible blocks, looking at byte 0 to byte 2047, byte 1 to byte 2048, etc., on up to byte 2047 to byte 4094.  If any of them are known, we use the first hit we find.  Otherwise, we scan through the end, and use the first chunk which did not collide (most often the first chunk.)  In the worst case, if every possible chunk would collide, we send it without deduplicating and move on to the next chunk.</div><div><br></div><div>Since we allow mixing compression with deduplication, even this worst case doesn't end up being very bad.  One could construct really bad cases, but they don't happen in practice as far as I've seen.</div><div><br></div><div>It's a great question, so I elected to CC the mailing list, which I think would benefit from seeing this and likely others in the future will wonder also.</div><div><br></div><div>Thanks,</div><div>Juli. </div></div></div></div>