Is it possible to run hadoop fs -getmerge in S3?

Is it possible to run hadoop fs -getmerge in S3?

I have an Elastic Map Reduce job which is writing some files in S3 and I want to concatenate all the files to produce a unique text file.
Currently I'm manually copying the folder with all the files to our HDFS (hadoop fs copyFromLocal), then I'm running hadoop fs -getmerge and hadoop fs copyToLocal to obtain the file.
is there anyway to use hadoop fs directly on S3?


Solution 1:

Actually, this response about getmerge is incorrect. getmerge expects a local destination and will not work with S3. It throws an IOException if you try and responds with -getmerge: Wrong FS:.


hadoop fs [generic options] -getmerge [-nl] <src> <localdst>

Solution 2:

An easy way (if you are generating a small file that fits on the master machine) is to do the following:

  1. Merge the file parts into a single file onto the local machine (Documentation)

    hadoop fs -getmerge hdfs://[FILE] [LOCAL FILE]
  2. Copy the result file to S3, and then delete the local file (Documentation)

    hadoop dfs -moveFromLocal [LOCAL FILE] s3n://bucket/key/of/file

Solution 3:

I haven’t personally tried the getmerge command myself but hadoop fs commands on EMR cluster nodes support S3 paths just like HDFS paths. For example, you can SSH into the master node of your cluster and run:

hadoop fs -ls s3://<my_bucket>/<my_dir>/

The above command will list of out all the S3 objects under the specified directory path.

I would expect hadoop fs -getmerge to work the same way. So, just use full S3 paths (starting with s3://) instead of HDFS paths.