21

Basically whole question is in the title. I'm wondering if it's possible to append to file located on HDFS from multiple computers simultaneously? Something like storing stream of events constantly produced by multiple processes. Order is not important.

I recall hearing on one of the Google tech presentations that GFS supports such append functionality but trying some limited testing with HDFS (either with regular file append() or with SequenceFile) doesn't seems to work.

Thanks,

maximdim
  • 8,041
  • 3
  • 33
  • 48
  • Here are some background details, why append is not possible, yet: [File Appends in HDFS](http://www.cloudera.com/blog/2009/07/file-appends-in-hdfs) – Dag Sep 01 '11 at 09:31

2 Answers2

12

I don't think that this is possible with HDFS. Even though you don't care about the order of the records, you do care about the order of the bytes in the file. You don't want writer A to write a partial record that then gets corrupted by writer B. This is a hard problem for HDFS to solve on its own, so it doesn't.

Create a file per writer. Pass all the files to any MapReduce worker that needs to read this data. This is much simpler and fits the design of HDFS and Hadoop. If non-MapReduce code needs to read this data as one stream then either stream each file sequentially or write a very quick MapReduce job to consolidate the files.

Spike Gronim
  • 6,154
  • 22
  • 21
  • Thanks. I guess I didn't realize that it doesn't have to be one file per MapReduce job. Writing one file per computer should be very simple to implement, perhaps using in-memory queue as suggested in another answer to avoid blocking. – maximdim Jun 20 '11 at 12:09
  • 4
    @Spike Just to clarify that GFS does support concurrent append. From their GFS paper: "Record append is heavily used by our distributed applications in which many clients on different machines append to the same file concurrently." – John David Jul 21 '12 at 05:44
  • You should get an [exception stating the file already exists](https://issues.apache.org/jira/browse/HDFS-8177). That jira says `"HDFS supports single writer at a time for a given file."` You can consolidate the files as suggested in this answer using [`getmerge`](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#getmerge) – EthanP Jan 21 '16 at 07:31
8

just FYI, probably it'd be fully supported in hadoop 2.6.x, acorrding to the JIRA item on the official site: https://issues.apache.org/jira/browse/HDFS-7203

Dan
  • 386
  • 4
  • 10