1

Using the java language I read text files that contain numbers. There are terabytes of data and hundreds of billions of numbers.

The goal is to fetch the data as fast as possible, and minimize GC activity. I want to parse text directly into primitives (double, float, int).

By directly I mean:

  • without instantiating any transient helper object
  • without boxing data in java.lang.Double, java.lang.Float...
  • without creating transient java.lang.String instances (a mandatory step if you are to call JDK Double.parseDouble(...))

So far I have been using the javolution framework:

double javolution.text.TypeFormat.parseDouble(CharSequence sequence);

I looked at the javolution code and it truly does not allocate any transient object. And because it accepts a CharSequence, you can present the characters decoded from the data files without instantiating transient Strings.

Are there alternatives or better ways?

Antoine CHAMBILLE
  • 1,676
  • 2
  • 13
  • 29

2 Answers2

2

The method Double.parseDouble(String) does instantiate an object under the hood, but it uses caching, returning a double read from the string.
This answer offers more details.

For the rest of 'em: the Javolution package seems to be written for real-time performance, thus it seems to be a proper package.

Community
  • 1
  • 1
MC Emperor
  • 22,334
  • 15
  • 80
  • 130
  • I think it does for each call create an instance of sun.misc.FloatingDecimal. public static double parseDouble(String s) throws NumberFormatException { return FloatingDecimal.readJavaFormatString(s).doubleValue(); } – Antoine CHAMBILLE Dec 06 '12 at 11:26
  • 1
    I think it instantiates an intermediary object under the hood. http://www.docjar.com/html/api/sun/misc/FloatingDecimal.java.html – Zutty Dec 06 '12 at 11:26
  • Another hidden issue is that Double.parseDouble() only works on a String. So while you are parsing the characters of a file, you also need need to create billions of transient String instance just for the sake of parsing them. – Antoine CHAMBILLE Dec 06 '12 at 11:30
  • @AntoineCHAMBILLE: Yes, but I supposed the text *is* already stored in a String or something that looks like it, because else I don't see the need to *parse* text. – MC Emperor Dec 06 '12 at 11:35
  • It means that the use case was not clear enough. I have updated my question. – Antoine CHAMBILLE Dec 06 '12 at 12:38
1

StreamTokenizer, examined here, may be worth profiling. It parses decimal numbers as double but does not handle scientific notation.

Community
  • 1
  • 1
trashgod
  • 203,806
  • 29
  • 246
  • 1,045
  • Indeed it looks like a StreamTokenizer can parse numbers without allocating a single object. But this antic class from java 1.0 outputs all the numbers as 'double', you cannot distinguish integers, single precision and double precision numbers. – Antoine CHAMBILLE Dec 06 '12 at 12:53