November 2013
« Oct   Dec »

How do I upload huge GB files to Amazon S3

Recently I have a project migration to Amazon for enabling auto scale features of servers. This projects have around 5.6 Lakhs files and having 2234 directories in it. So it is hard to find the broken files once the S3 file upload had been broken.

How to I find the number of files and folders in a directory.

Use the command “for t in files links directories; do echo `find . -type ${t:0:1} | wc -l` $t; done 2> /dev/null”

[root@ip-10-114-123-227 ~]# for t in files links directories; do echo `find /home/liju/public_html/uploads/ -type ${t:0:1} | wc -l` $t; done 2> /dev/null
525034 files
0 links
2234 directories
[root@ip-10-114-123-227 ~]#

Goggling leads me to use the combination of rsync and s3fs to felicitate this requirement as s3cmd tool showing hang state on my m1.small instance. I tried rsync twice and got the same result. What I had done is installed s3fs and mounted it as drive in my server. Then I start file synchronization

rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]: Broken pipe (32)
rsync: connection unexpectedly closed (4585911 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]

Finally I found that “/tmp” partition has become full while rsync caching huge files and hence the operation was terminated abnormal :-(. So I added another 10GB volume and mounted as “/tmp”. Then re-start the same job. It got works !! 🙂

This mode of s3 upload (rsync + S3fs) is really slow as I’m getting file transfer rate of 69/Min. This may take 4-5 days in mathematically. So I’m looking for other options.

My Attempts are,

1. I took the snapshot of the volume having 5.6 Lakhs files and attached to high end computing powered instance (c1.xlarge) and add extra IOPs on the volume which I had created from the snapshot drive.

2. Installed S3cmd tools and upload the files using sync command. This command take 30 minutes to index the entire 5.2 Lakhs files and start the upload. Upload speed was Awesome.. This take just 12 hours to complete the entire files to S3 like a charm 🙂 🙂

This proves s3cmd has rocket speed as I see 12Mbps speed while uploading files from the same amazon network. I uses a cronjob to start uploading. It took 12 hours to complete the 27GB uploads to S3 storage.

 #s3cmd sync  setacl --acl-public --recursive  --add-header="Expires:`date -u +"%a, %d %b %Y %H:%M:%S GMT" --date "+5 years"`" --add-header='Cache-Control:max-age=31536000, public'  /home/user1/public_html/uploads/ s3://mybuscketname/  2>&1 | tee /home/installation/s3-uploads.txt

This command dp the following,

a. Set Expiry Header of each image objects
b. Set Cache Header to each Object uploaded to S3
c. Public S3 objects to Internet. So that browsers can download the objects.

Setting Expiry header and Cache Header enhance the user experience as well as reduce the S3 bandwidth cost. Happy S3ing !! 🙂

My cronjob entry is shown below,

12 12 23 * * s3cmd sync  setacl --acl-public --recursive  --add-header="Expires:`date -u +"%a, %d %b %Y %H:%M:%S GMT" --date "+5 years"`" --add-header='Cache-Control:max-age=31536000, public'  /home/user1/public_html/uploads/ s3://mybucketname/  2>&1 | tee /home/installation/s3-uploads.txt


Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>