2

I am using s3disctcp to copy 31,16,886 files(300 GB) from S3 to HDFS and it took 4 days to just copy 10,48,576 files .I killed the job and need to understand how can i reduce this time or what am i doing wrong.

s3-dist-cp --src s3://xml-prod/ --dest hdfs:///Output/XML/

Its on AWS EMR machine.

franklinsijo
  • 17,784
  • 4
  • 45
  • 63
Priyanka O
  • 21
  • 2
  • Well, i used a bigger instance of EMR, m4.4xlarge. the S3 and the EMR were in the same region. – Priyanka O Feb 27 '17 at 10:53
  • i had the same observation as this post here ->http://stackoverflow.com/questions/38462480/s3-dist-cp-and-hadoop-distcp-job-infinitely-loopin-in-emr – Priyanka O Feb 27 '17 at 11:18

2 Answers2

0

The issue is in HDFS and its poor performance when dealing with lots of small files. Consider combining files before putting them into HDFS. groupby option of s3distcp command provides one way of doing that.

Denis
  • 148
  • 4
  • Thanks for the Reply Denis. I am not sure if combining those files will be a good idea as I have to consume those Individual files via a Spark application which will pick the needed columns from each of those individual Xml's and save as a parquet format. May be if you have any other ideas, that would be also good. Each individual file is like a row/record. Thanks – Priyanka O Feb 28 '17 at 08:06
  • It looks like that the way of keeping data in S3 buckets might need rethinking. E.g., since a file can be treated like a record - why not to group all 3mln files into significant smaller number of files? JSON can work quite well here, see e.g. http://stackoverflow.com/questions/16906010/storing-xml-inside-json-object – Denis Mar 01 '17 at 11:14
  • Hi Denis, these are really big XML files and i only need a subset of the data. the approach you suggested is interesting but the thing is I still need to download the files locally in EC2 or EMR instace to work further on it. The AWS ClI command are not reliable as somethimes some files dont get downloaded and you need to run a seperate bash script to get those missing files. I am now looking into mounting the S3 , to see if thats a simple and fast approach. – Priyanka O Mar 03 '17 at 13:12
0

Why not do the entire process as part of a single application pipeline? That way you don't have to store lot of small intermediate files in HDFS.

S3 File Reader --> XML Parser --> Pick Required Fields --> Parquet Writer (single file with rotation policy)

ashwin111
  • 146
  • 1
  • 4