Freezed download via Wanproxy

Mon Mar 19 16:33:13 PDT 2018

Or, possibly, it is better to generate error instead of EOS ?

2018-03-20 1:02 GMT+02:00 Ivan Pizhenko <ivan.pizhenko at gmail.com>:
> Hi Juli,
>
> I have debugged Wanproxy on Linux using simples configuration without
> caching and xcodec, just to find out how it does it correctly.
> As I said in previous letter, I was thinking that one of following events
> (1) poll_handler->callback(Event::EOS);
> (2) poll_handler->callback(Event::Error);
> should happen and finally result closing client connection, but I was wrong.
> Neither of these have happened, instead it was zero-length read which
> was translated somewhere in the IOSystem::Handle::read_do() into
> Event::EOS.
>
> After this I was thinking how it could be possible that EOS generated
> bu then actually lost, and have performed more experiments.
> I have noticed that among other options, I have turned compression on.
> So I have did the same experiment without compression and with
> compression, and surprisingly with compression it doesn't work but
> works without it. After this, to make additional proof, I have created
> another configuration without Xcodec and cache, but just with
> compression, and with that I could reproduce the issue again and
> again. Then I have added some logging at the beginning of the inflate
> and deflate pipe classes method consume() and received following log
> output on the "client" Wanproxy:
>
> 1521499142.151618 [/zlib/inflate_pipe] DEBUG: virtual void
> InflatePipe::consume(Buffer*): InflatePipe: Consuming 65536
> 1521499143.399245 [/event/poll] DEBUG: virtual void EventPoll::main():
> EPOLLIN: 9
> 1521499143.399660 [/zlib/inflate_pipe] DEBUG: virtual void
> InflatePipe::consume(Buffer*): InflatePipe: Consuming 0
>
> So last invocation of InflatePipe::consume() brings that zero-sized
> buffer to it. But quick code review shows that inflate/deflate pipe
> consume() methods don't do anything special this case, and that's
> really bad, because it causes the stuck download issue. I assume that
> correct behaviour should be to flush what was compressed so far and
> generate EOS, or just immediately generate EOS, losing incomplete
> compressed data chunk.
>
> What do you think?
>
> This is especially important to understand for me, since I am planning
> to add more compression methods in a future (for example, I'd like to
> have LZ4).
>
> Ivan.
>
> 2018-03-17 9:53 GMT+02:00 Ivan Pizhenko <ivan.pizhenko at gmail.com>:
>> Hi Juli,
>>
>> First of all, thank you for finding time to reply to my messages.
>>
>> So far I have only briefly looked at XCodec intervals, and I have also
>> notices that it has these XCODEC_PIPE_OP_EOS and
>> XCODEC_PIPE_OP_EOS_ACK. So I am planning to dig into it and see how
>> they are handled.
>>
>> Regarding breaking functionality: I assume Xcodex doesn't work with
>> socket directly and should be invariant to what actually happens with
>> socket as long as it gets correct inout sequence of control codes,
>> right?
>>
>> I have seen there are two possibilies:
>> poll_handler->callback(Event::EOS);
>> and
>> poll_handler->callback(Event::Error);
>>
>> So need to understand which one works in this case and make sure
>> correct action taken. I think, in the both cases, connection on the
>> proxy's "interface" should be closed.
>> Am I correct?
>>
>> Ivan.
>>
>> 2018-03-17 1:47 GMT+02:00 Juli Mallett <juli at clockworksquid.com>:
>>> Hi Ivan,
>>>
>>> I've worked on that code pretty extensively in the past, and I'm pretty sure
>>> I tested a wide set of circumstances, but it's certainly possible there's
>>> some missing edge case.  I'd suggest looking at the code related to
>>> XCODEC_PIPE_OP_EOS and XCODEC_PIPE_OP_EOS_ACK.  If I were testing this, and
>>> trying to reproduce it, my first step would be to put INFO or DEBUG
>>> statements throughout the code, and to watch what happens in as simplified
>>> of a test case as possible, to determine the correct behaviour.
>>>
>>> Note that you have to be very slow and methodical in changing these things,
>>> as you can easily make a change which will close the connection in your
>>> case, but which is wrong in the case where shutdown(2) is being used on a
>>> connection which may, in fact, be long-lived.  That's the sort of mistake
>>> that people tend to make in working on middleboxes and proxies: overfitting.
>>> In this case, it sounds like you're reproducing a case where the EOS
>>> machinery isn't running properly, but without digging into it, it's hard to
>>> be sure what's being too conservative, and how to fix it without breaking
>>> other things.  If I were reproducing and testing the issue, my expectation
>>> would be that it would come out to be a fairly simple fix in most cases, but
>>> I've been wrong about that before.
>>>
>>> I'd be shocked if the polling code for kqueue was wrong, and mildly
>>> surprised if it were wrong for epoll, given how extremely widely deployed
>>> and tested that code is.  Your assessment that it's probably in the XCodec
>>> protocol stuff is probably right, and I hope any of this is helpful to you.
>>> It sounds like you're an accomplished programmer working on WANProxy, so I'm
>>> sure you'll be able to figure it out.  If you run in verbose (-v) mode, with
>>> debugging compiled in, you should see that there's already some debugging
>>> statements around these cases.  Where there might be some loss of fidelity
>>> would be in how errors, rather than simply ordinary end-of-stream, propagate
>>> into the pipe system.  There's a lot of testing and work I've done on
>>> related things, mostly using libuinet, that aren't part of the open source
>>> version of WANProxy, so if I had to guess about a location for an issue
>>> outside of XCodec, that's where I'd think about looking.  Like, the case
>>> where Splice::complete is called with an error: the underlying connections
>>> should be torn down, but it's possible that's not happening for some reason.
>>>
>>> Again, just be careful: when changing this kind of thing, overfitting is
>>> extremely easy to do.  Good luck, and I look forward to hearing what you
>>> find!  I wish I had time to take a look and provide either a patch or a more
>>> helpful set of suggestions myself.
>>>
>>> Thanks,
>>> Juli.
>>>
>>> On Fri, Mar 16, 2018 at 4:32 PM, Ivan Pizhenko <ivan.pizhenko at gmail.com>
>>> wrote:
>>>>
>>>> Hi Juli,
>>>>
>>>> I've started exploring Wanproxy code and found that socket event
>>>> polling with epoll(), which I use in Linux, is likely done correctly.
>>>> To check this, I've performed another experiement -  I have set
>>>> "codec" to None on the both server and client and tiried again.
>>>> And it started to work correctly, exactly as I expect - when I kill
>>>> "server" Wanproxy, "client" Wanproxy has disconnected its client -
>>>> but... without any traffic optimization, which I want Wanproxy to do.
>>>> So the issue must be inside XCodec. Can you please help me to identify
>>>> it and recommend how to fix?
>>>>
>>>> Ivan.
>>>>
>>>>
>>>> 2018-03-15 6:43 GMT+02:00 Ivan Pizhenko <ivan.pizhenko at gmail.com>:
>>>> > Hi Juli,
>>>> >
>>>> > I have managed to install couple FreeBSD 11 RELEASE VMs (that was
>>>> > really tricky, but setting up second one was finally easier than
>>>> > first), built the Wanproxy on them and executed the same experiment.
>>>> > I have tried few various combinations: all locally, on the same
>>>> > Linux/FreeBSD machine, and client on the one Linux/FreeBSD machine
>>>> > with server on the different Linux/FreeBSD machine.
>>>> > And the result was the same in all cases - when "server" Wanproxy goes
>>>> > down, "client" Wanproxy does not disconnect its client. So I think
>>>> > there must be major issue the Wanproxy logic.
>>>> > I still did not review source code deeply yet, but can you please
>>>> > confirm, do you really think that current implementation should
>>>> > propagate connection state correctly inside "client" Wanproxy?
>>>> >
>>>> > Also I have got Wanproxy crash on FreeBSD, when attempted to specify
>>>> > server VM name in the client wanproxy config.
>>>> > I have put following into my client.conf:
>>>> >
>>>> > create peer peer0
>>>> > set peer0.family IP
>>>> > set peer0.host "wptest1"
>>>> > set peer0.port "3301"
>>>> > activate peer0
>>>> >
>>>> > This have given me following error (and crash right after it):
>>>> > 1521079851.327281 [/socket/address] ERR: bool
>>>> > socket_address::operator()(int, int, int, const string&): Could not
>>>> > look up [wptest1]:3301: hostname nor servname provided, or not known
>>>> > 1521079851.327354 [/socket/handle] ERR: static SocketHandle*
>>>> > SocketHandle::create(SocketAddressFamily, SocketType, const string&,
>>>> > const string&): Invalid hint: [wptest1]:3301
>>>> > ./client.sh: line 1: 13501 Segmentation fault (core dumped) ./wanproxy
>>>> > -c client.conf
>>>> >
>>>> > Note that on Linux that worked pretty good.
>>>> > I have had name resolution configured through WINS (Samba), i.e. have
>>>> > running Samba with valid config, and have wins added to
>>>> > /etc/nsswitch.conf:
>>>> >
>>>> > hosts: files wins dns
>>>> >
>>>> > Note that ping has reached that host successfully:
>>>> >
>>>> > $ ping wptest1
>>>> > PING wptest1 (192.168.150.11): 56 data bytes
>>>> > 64 bytes from 192.168.150.11: icmp_seq=0 ttl=64 time=0.266 ms
>>>> > 64 bytes from 192.168.150.11: icmp_seq=1 ttl=64 time=0.234 ms
>>>> > 64 bytes from 192.168.150.11: icmp_seq=2 ttl=64 time=0.381 ms
>>>> > 64 bytes from 192.168.150.11: icmp_seq=3 ttl=64 time=0.382 ms
>>>> > 64 bytes from 192.168.150.11: icmp_seq=4 ttl=64 time=0.269 ms
>>>> > ^C
>>>> > --- wptest1 ping statistics ---
>>>> > 5 packets transmitted, 5 packets received, 0.0% packet loss
>>>> > round-trip min/avg/max/stddev = 0.234/0.306/0.382/0.063 ms
>>>> >
>>>> > But wanproxy crashed.
>>>> > I had to specify IP address (192.168.150.11) instead of name(wptest1)
>>>> > to mitigate this.
>>>> > But it works on Linux with no matter there is IP address or host name.
>>>> >
>>>> > WBW, Ivan.
>>>> >
>>>> >
>>>> > 2018-03-07 5:01 GMT+02:00 Juli Mallett <juli at clockworksquid.com>:
>>>> >> Hi Ivan,
>>>> >>
>>>> >> I don't know the Linux TCP/IP stack, unfortunately, so I can't be any
>>>> >> help
>>>> >> there.  In your case, I think you might want to consider adding, or
>>>> >> having
>>>> >> someone add, a simple heartbeat mechanism to the xcodec protocol in
>>>> >> WANProxy.
>>>> >>
>>>> >> Thanks,
>>>> >> Juli.
>>>> >>
>>>> >> On Tue, Mar 6, 2018 at 6:15 PM, Ivan Pizhenko <ivan.pizhenko at gmail.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> Hi Juli,
>>>> >>>
>>>> >>> Thanks for replying to my email.
>>>> >>>
>>>> >>> I am using Linux. I have set up VirtualBox VM with Xubuntu 16.04 LTS
>>>> >>> with latest HWE kernel 4.13 and all latest updates. I have not tuned
>>>> >>> any OS options related to networking and TCP/IP protocol. I am not
>>>> >>> using libuinet. I am not targeting FreeBSD, I need to have it working
>>>> >>> on Linux, primarily on Ubuntu Server.
>>>> >>>
>>>> >>> So I also was expecting that connection should be reset after some
>>>> >>> reasonable timeout, but that didn't happen (or I have waited for too
>>>> >>> short time??? I remember it was like at least 10 minutes). So present
>>>> >>> mechanism seems to don't work. Thanks, heartbeat is interesting idea,
>>>> >>> but probably there is something we can do via TCP connection settings
>>>> >>> that we did not do yet? I am not big specialist in TCP protocol
>>>> >>> settings, but I suppose you must be more aware in this area, so I am
>>>> >>> asking about this, probably you can recommend something else. If
>>>> >>> nothing more can be done, then sure, I will need to implement
>>>> >>> heartbeat.
>>>> >>>
>>>> >>> Ivan.
>>>> >>>
>>>> >>>
>>>> >>> 2018-03-06 3:48 GMT+02:00 Juli Mallett <juli at clockworksquid.com>:
>>>> >>> > Hi Ivan,
>>>> >>> >
>>>> >>> > WANProxy should pass along state when a stream is closed from end to
>>>> >>> > end,
>>>> >>> > not perfectly, but your connection should be properly reset at some
>>>> >>> > point
>>>> >>> > from the server going away.  There isn't anything that can be done
>>>> >>> > in a
>>>> >>> > protocol-neutral way that exceeds that, but that should be good
>>>> >>> > enough
>>>> >>> > for
>>>> >>> > most uses.  Of course there are things that can disrupt the TCP
>>>> >>> > state
>>>> >>> > machine, or settings on a system can mean that connections aren't
>>>> >>> > timed
>>>> >>> > out
>>>> >>> > when they should be.
>>>> >>> >
>>>> >>> > Are you using libuinet, FreeBSD, Linux, or something else for the
>>>> >>> > TCP/IP
>>>> >>> > stack?
>>>> >>> >
>>>> >>> > An easy change would be to add a heartbeat on all active sessions
>>>> >>> > with
>>>> >>> > WANProxy to actively probe for disconnected peers, but I'm not sure
>>>> >>> > I'd
>>>> >>> > encourage that.  If you think that would be helpful to you, let me
>>>> >>> > know.
>>>> >>> >
>>>> >>> > Thanks,
>>>> >>> > Juli.
>>>> >>> >
>>>> >>> > On Sat, Feb 24, 2018 at 1:09 AM, Ivan Pizhenko
>>>> >>> > <ivan.pizhenko at gmail.com>
>>>> >>> > wrote:
>>>> >>> >>
>>>> >>> >> Hi,
>>>> >>> >>
>>>> >>> >> I am making some tests with Wanproxy to understand how much it is
>>>> >>> >> stable and reliable. I am using latest Wanproxy code from Github
>>>> >>> >> and
>>>> >>> >> work on Ubuntu 16.04 LTS with kernel 4.13 and all latest updates.
>>>> >>> >>
>>>> >>> >> I have conducted following simple test:
>>>> >>> >>
>>>> >>> >> I have installed locally Apache 2 HTTP Server and put some large
>>>> >>> >> file
>>>> >>> >> into the document root. Then I have configured, also locally,
>>>> >>> >> "client"
>>>> >>> >> and "server" Wanproxy similar to how it is described in examples
>>>> >>> >> section on wanproxy.org, but without ssh tunnel between them, to
>>>> >>> >> proxy
>>>> >>> >> Apaches's HTTP port. Then I have used wget to download that large
>>>> >>> >> file
>>>> >>> >> through "client" Wanproxy. It worked fine but slower than direct
>>>> >>> >> download from Apache. Then I have tried to do the same thing but  I
>>>> >>> >> have shut down "server" Wanproxy somewhere in the middle of
>>>> >>> >> download.
>>>> >>> >> The download has freezed, the were no further progress. When I have
>>>> >>> >> restarted "server" Wanproxy, the download did not resume. When I
>>>> >>> >> shut
>>>> >>> >> down client Wanproxy, wget showed error like "connection refused"
>>>> >>> >> and
>>>> >>> >> exited.
>>>> >>> >>
>>>> >>> >> I would expect that when "server" Wanproxy went down, "client" one
>>>> >>> >> would disconnect clients connected to it to indicate that upstream
>>>> >>> >> link is broken, if not immediately, then after some reasonable
>>>> >>> >> timeout. Is there a way to achieve something like this with
>>>> >>> >> Wanproxy?
>>>> >>> >> If not, what changes to Wanproxy are needed to enable such
>>>> >>> >> functionality?
>>>> >>> >>
>>>> >>> >> Ivan.
>>>> >>> >> _______________________________________________
>>>> >>> >> wanproxy mailing list
>>>> >>> >> wanproxy at lists.wanproxy.org
>>>> >>> >> http://lists.wanproxy.org/listinfo.cgi/wanproxy-wanproxy.org
>>>> >>> >
>>>> >>> >
>>>> >>
>>>> >>
>>>
>>>