21

What is the advantage of Hadoop Sequence File over HDFS flat file(Text)? In what way Sequence file is efficient?

Small files can be combined and written into a sequence file, but the same can be done for a HDFS text file also. Need to know the difference between the two ways. I have been googling about this for a while, would be helpful if i get clarity on this?

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
hrkrshn
  • 213
  • 1
  • 2
  • 7
  • 1
    Just some questions for you: Does your textfile has checksums? Does your textfile can be split easily if the records are not in a single line? That's actually the advantage of a sequence file. Besides that your text file are only strings, where you can serialize arbitrary data types in a sequence file. – Thomas Jungblut Aug 02 '12 at 13:42
  • 1
    doesn't any block in HDFS have a checksum ? – Razvan Aug 02 '12 at 13:52
  • Yep you're right, that is a feature of the `ChecksumFileSystem`. – Thomas Jungblut Aug 03 '12 at 11:43

3 Answers3

26
  1. Sequence files are appropriate for situations in which you want to store keys and their corresponding values. For text files you can do that but you have to parse each line.
  2. Can be compressed and still be splittable which means better workload. You can't split a compressed text file unless you use a splittable compression format.
  3. Can be approached as binary files => more storage efficient. In a text file a double will be a number of chars => large storage overhead.
Razvan
  • 9,925
  • 6
  • 38
  • 51
2

Advantages of Hadoop Sequence files ( As per Siva's article from hadooptutorial.info website)

  1. More compact than text files
  2. Provides support for compression at different levels - Block or Record etc.
  3. Files can be split and processed in parallel
  4. They can solve large number of small files problem in Hadoop where Hadoop main advantage is processing large file with Map reduce jobs. It can be used as a container for large number of small files
  5. Temporary output of Mapper can be stored in sequential files

Disadvantages:

  1. Sequential files are append only
Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
0

Sequence files are intermediate files generated during mapper and reducer phase of MapReduce processing. Sequence file are compressible and fast in processing it is used to write output during mapper and reducer reds from it. There are APIs in Hadoop and Spark to read/write sequence files

Shailesh
  • 405
  • 1
  • 5
  • 18