3

I have tried different ways to create a large Hadoop SequenceFile with simply one short(<100bytes) key but one large (>1GB) value (BytesWriteable).

The following sample works for out-of-box:

https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/BigMapOutput.java

which writes multiple random-length key and value with total size >3GB.

However, it is not what I am trying to do. So I modified it using hadoop 2.2.0 API to something like:

      Path file = new Path("/input");
      SequenceFile.Writer writer = SequenceFile.createWriter(conf,
      SequenceFile.Writer.file(file),
      SequenceFile.Writer.compression(CompressionType.NONE),
      SequenceFile.Writer.keyClass(BytesWritable.class),
      SequenceFile.Writer.valueClass(BytesWritable.class));
      int numBytesToWrite = fileSizeInMB * 1024 * 1024;
      BytesWritable randomKey = new BytesWritable();
      BytesWritable randomValue = new BytesWritable();
      randomKey.setSize(1);
      randomValue.setSize(numBytesToWrite);
      randomizeBytes(randomValue.getBytes(), 0, randomValue.getLength());
      writer.append(randomKey, randomValue);
      writer.close();

When fileSizeInMB>700MB, I am getting errors like:

java.lang.NegativeArraySizeException
        at  org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:144)
        at  org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:123)
        ...

I see this error being discussed, but not see any resolution. Note that int(2^32) can be as large as 2GB, it should not fail at 700MB.

If you have other alternative to create such large-value SequenceFile, please advise. I tried other approaches like IOutils.read from inputstream into a byte [], I got heap size or OOME.

trincot
  • 317,000
  • 35
  • 244
  • 286
user815613
  • 95
  • 3
  • 8
  • Hello. I am facing the same problem as of now. Did u resolve this error. Please share it. – Shash May 26 '15 at 05:59

2 Answers2

1

just use ArrayPrimitiveWritable instead.

There is an int overflow by setting new capacity in BytesWritable here:

public void setSize(int size) {
    if (size > getCapacity()) {
       setCapacity(size * 3 / 2);
    }
    this.size = size;
}

700 Mb * 3 > 2Gb = int overflow!

As result you cannot deserialize (but can write and serialize) more than 700 Mb into BytesWritable.

dimo414
  • 47,227
  • 18
  • 148
  • 244
  • This has since been addressed; in the [current implementation](https://hadoop.apache.org/docs/current/api/src-html/org/apache/hadoop/io/BytesWritable.html#line.121) they use longs to avoid the unnecessary overflow. – dimo414 Nov 24 '16 at 08:08
0

In case you would like to use BytesWritable, an option is set the capacity high enough before, so you utilize 2GB, not only 700MB:

randomValue.setCapacity(numBytesToWrite);
randomValue.setSize(numBytesToWrite); // will not resize now

This bug has fixed in Hadoop recently, so in newer versions it should work even without that:

public void setSize(int size) {
  if (size > getCapacity()) {
    // Avoid overflowing the int too early by casting to a long.
    long newSize = Math.min(Integer.MAX_VALUE, (3L * size) / 2L);
    setCapacity((int) newSize);
  }
  this.size = size;
}
mcserep
  • 3,231
  • 21
  • 36