would it help to drop the chunk size from 128 bytes to 64 bytes

Tue Mar 22 12:04:48 PDT 2011

Hey JJ,

On Tue, Mar 22, 2011 at 07:13, Junaid Jeewa <Junaid.Jeewa at craneww.com> wrote:
> Idea behind this thought is to reduce the chunk size and thus increase the
> probability of more matches given the smaller sample size. Please understand
> that I am not a C programmer so take no offense please if this  is a
> ridiculous idea.

It's not a bad idea, honestly, and I've thought a lot about ways that
we can match partial chunks so that we get the benefit you're talking
about.  And actually, the chunk size has recently increased to 2048
bytes from 128!

There's a few downsides that mean that the bigger chunk sizes are
better for performance.  I'm happy to try to explain them!

First off, with 64 bytes the compression ratio would go down a lot.
If it takes 9 bytes to refer to a previously-sent 64-byte block, then
that's a 1:7.11 ratio.  For a 2048-byte block, it's 1:227.  So for a
2MB file, you have to sent 288KB with 64-byte blocks.  With 2048-byte
blocks you only have to send 9KB to reconstitute that 2MB file.

Second, it really slows down the encoding process.  In order to really
get a performance improvement, we need to be able to do our work
quickly or we actually cause a performance hit and WANProxy is useless
(except for reducing bandwidth usage, so maybe it's worth running it
on an expensive BGAN link even if it increases latency.)  With a
2048-byte block, imagine that that means that every 2KB we have to
look in a database to see if the previous 2KB have ever occurred.  Now
think about doing that for a 64-byte block.  Of course, we have to
make that case as fast as possible anyway, but for data that is
identical or almost identical to something we've sent before, we can
be really, really fast if all we have to do is check every 2KB.

Third, a smaller chunk size means that for the same amount of data,
you'll be storing more chunks.  For on-disk storage, if I ever get
around to implementing it, that won't matter much since we'll
hopefully encode the data pretty compactly, but in memory that adds a
lot of overhead.  We have lookup tables that are hash tables and
B-Trees.  More chunks for the same amount of data means longer chains
in the hash table and more nodes in the B-Tree.  It may only add a few
operations, but it adds up.  And it means more RAM overhead, because
we have pointers to each of those chunks, and that's 8 bytes plus for
each chunk that's alive in the system.  More chunks also means that
the collision rate — that is, the rate at which we have to waste time
because we got a previous chunk that has the same hash but was
different — goes up.

There's a few other things like that, but like in the last item it
tends to get pretty deep into implementation details only a programmer
could care about.  But hopefully that explains why I've moved in the
direction I have on chunk size and why I think we're unlikely to
switch to a smaller chunk size.

And actually, it helps to think about what the purpose of those chunks
is.  There's three ways to look at it: that we're trying to optimize
all traffic; that we're trying to optimize traffic that is _like_
traffic we've sent before; traffic that is identical to traffic we've
sent before.  I think it's easy to imagine that the first one of those
is the goal, or that it would be nice (and it would be!) but it's not
so likely.  There are cases where you can do it — I think you're
likely to find a 64-byte block of zeroes in just about any data stream
from time to time, but much less likely to find a 2KB-byte block of
zeroes in any given data stream — but those also tend to be cases
where using a normal ephemeral compression algorithm like gzip or
deflate is helpful.  It does things like run-length encoding, and is
supported in WANProxy.  So you can have our algorithm (XCodec) go
through and remove duplicate blocks, if it finds them, and then gzip
will not just compress the chunks we left behind, it will also
compress the WANProxy overhead, which is deliberately very easy to
compress!  Even with 64-byte chunks, there are 2^512 possible values
for any given 64-byte sequence.  Some are much more likely than
others, but I think it's still better to focus on the last two ones,
traffic that either is identical to or similar to previously-sent
traffic.

In those cases, 2048-byte chunks aren't so bad.  If you send a
document that is exactly like a document you sent before and you
modify a byte in the middle, then with 64-byte chunks you only have to
send 64 bytes to the remote side for it to learn the new chunk.  With
2048-byte chunks, you have to send 2KB.  But again, assume it's a 2MB
file.  So you saved 1984 bytes by using the smaller chunk size to
encode that change, but now look at the size of the rest of the file.
Those 1984 bytes are hardly a significant gain when to get that gain
you had to send a quarter of a megabyte.  Even with the modified chunk
in the middle, the 2048-byte chunk method only takes abut 11KB.  And
if a byte is added instead of changed, both perform as well as they
normally would, picking up where they left off after sending that one
byte unencoded.

That's the same kind of behavior you want for traffic that's only like
traffic you've sent before and for traffic that's identical to traffic
you've sent before.  I hope that helps answer your question :)

Of course, you're welcome to try it and see!

Just change BUFFER_SEGMENT_SIZE in common/buffer.h and change
XCODEC_SEGMENT_LENGTH in xcodec/xcodec.h.  Then just do "make clean"
and then "make" again in programs/wanproxy.  If you (or anyone else)
does want to test with a bunch of different chunk sizes, you're
encouraged to share your results with the list!  Remember that both
halves of a proxy need to use the same chunk size.

Thanks!
Juli.