I’ve been doing a lot of S3 work lately, mainly uploading very large files, between 100MB to 5GB. On a box that was responsible for uploading backups to S3, I would consistently get errors like:
(Using a Ruby library)
(Using a Python library)
or it would just plain stop (silently fail) using a bash-only library (but I could see in tcpdump, the connection was reset).
This was driving me nuts!
So after looking at enough tcpdump’s to make my eyes bleed, and contacting Amazon (they were no help), I finally determined the cause of the problem to be a combination of:
- TCP Window Scaling
- Linux kernels 2.6.17 or newer
Basically, as I understand it, Linux kernels 2.6.17+ increased the maximum size of the TCP window/buffer, and this started to cause other gear to wig out, if it couldn’t handle sufficiently large TCP windows. The gear would reset the connection, and we see this as a “Connection reset by peer” message.
The fix for this is easy peasy, but it’ll slow your maximum throughput (this is why the Linux kernel guys decided to up the default in the first place, give people faster downloads, why not). Put the following in /etc/sysctl.conf:
1 2 3
The last number in that group of three is the important one. If you’re not getting any resets, increase it and your uploads/downloads will be faster. If you are getting resets, decrease it and it’ll make the resets go away, but your throughput will be slower now.
Your location of /etc/sysctl.conf might be different (I’m on Ubuntu). But you get the idea.
I should note that this problem doesn’t just happen with S3, but a LOT of other sites as well. It’s actually been well documented outside of the realm of S3, and if you want more info, refer to:
Hopefully this post will save someone from doing too much of what I did.