Java: Fastest way to read ~200 million edges into memory from text file?

Question

I have a text file containing 200 million edges in the form of:

12 34
12 920

Indicating an edge from node 12 to node 34. They need to be stored in memory in such a way as to allow easy access to the list of adjacent edges so that it's fast to look up every edge connected to a given vertex.

I'm using a HashMap to store the nodes, and each Node just contains a list of links:

public class Node {
    List<Node> links;

    public synchronized void AddLink(Node node)
    {
        if (links.indexOf(node) == -1)
            links.add(node);
    }
}

I'm also using BufferedReader.readLine() to read each line from the text file. The problem is, this approach takes around 85 seconds to read all 200 million edges.

After 30 hours, I'm currently leaning towards believing that those speeds are impossible in Java. Is there a faster implementation that I'm just no seeing?

Start by taking out the hashmap - just read the data an do nothing with it. How fast is that? If still too slow then BufferedReader can't do it. If it is fast enough then your data structures/storage are the problem, not the file read. — John3136, Apr 13 '16 at 00:22
Take a look at this, maybe it'll help: http://stackoverflow.com/questions/1605332/java-nio-filechannel-versus-fileoutputstream-performance-usefulness — Maljam, Apr 13 '16 at 00:23
You can always skip using a reader and read the bytes directly. As you won't need Strings you can save some created objects by saving the data to int directly. If you have enough memory, you could also read the file all at once by using `java.nio.file.Files.readAllBytes`, but that needs a lot of memory and may fail otherwise. Alternatively, install an SSD. (Did your teacher even take hardware into account?) — Zhedar, Apr 13 '16 at 00:25
@HovercraftFullOfEels No, but that would be my first step in trying to narrow down the problem, As a learning experience for basic debug (a skill which seems non existent these days) I think it is worthwhile. I will not buy this record. It is scratched ;-) — John3136, Apr 13 '16 at 00:27
@John3136 The file i/o by itself takes around 23 seconds. I might be able to multithread it in order to reduce that down. The data storage is taking the most time, 90% of which is spent adding links to the nodes. I can't for the life of me figure out why. — K. Stevens, Apr 13 '16 at 00:37
You can read millions of lines a second with `BufferedReader`. That won't be the problem. It's a data structure problem. Most of the time is being taken up when calling `node.addLink(Node node)` *and* `node.indexOf(node)`. At least try to find a way to combine these operations. Both operations are *O(N)* and you are executing them *N* times, which makes the entire operation *O(N^2)*. Can you use a `Set` instead of a `List`? (i.e. do you need the ordering?) Removing buffering won't help. NB you don't need ` while (line != null)`. That outer loop only executes once. @HovercraftFullOfEels — user207421, Apr 13 '16 at 00:41
This is an interesting data ingestion problem in the `ego networks` analysis field. To find it out first need a comparison on a small scale. Please run your java file against 10, 100, 1000, 10,000, 100,0000 edges and tell us how msec it takes for each of those dataset sizes. — loretoparisi, Apr 13 '16 at 00:48
@Zhedar 'As you won't need Strings you can save some created objects by saving the data to int directly' how? `Scanner` creates strings, and `nextInt()` then parses them to integers; every other way in the JDK is similar. Or are you assuming the file is binary? — user207421, Apr 13 '16 at 01:04
@EJP Using HashSet runs around 10% more slowly. Since each node tends to have under 300 edges I don't think that using a List here is terrible. — K. Stevens, Apr 13 '16 at 01:07
@K.Stevens Hard to believe. You do realize that `Set.add()` returns a boolean? You don't need to test for existence. It should be many times as fast as `HashSet` is *O(1)* for insertion. Over a large dataset this will really add up. — user207421, Apr 13 '16 at 01:19
I'm just using links.add(node). I don't know why it's so slow. — K. Stevens, Apr 13 '16 at 01:23
Does `Node` have a reasonable implementation of `hashCode()` and `equals()`? — user207421, Apr 13 '16 at 01:50
@K.Stevens how did you implement this in the end? Ran into a similar requirement. Did you use the hashmap or switched to list of lists? — AnOldSoul, Feb 06 '17 at 00:37

Pufan Jiang · Accepted Answer · 2016-04-13T02:32:17.580

This question is interesting. It would be better if you could provide more information.

One important point you are missing is, on what kind of machine are you going to reach that goal? How many memory does it have? How fast is the CPU? How many cores? How fast is the I/O?

But anyway, here are some analysis which may help. And if you could provide more information, then we could analyze more.

1.Memory

(Revised, I made a mistake in first answer. I didn't noticed you used an ArrayList)

So you are using a HashMap of ArrayList. But that doesn't guarantee the memory overhead.

Suppose Integer and int are 4 bytes, and reference is 8 bytes (I'm very likely to be wrong here, just think it as a pointer).

In the best case, say there's only one vertex linking to all other vertexes, and this vertex is the first number in your file. Then memory would be 200M * 8 bytes = 1.6 GB.

But in the worst case, still one vertex linking to others, but now this vertex is the second number in your file. Then memory would be 200M * 20 Bytes = 4 GB.

The reason for the worst case is that, after a skim of Java HashMap's source code, each node/entry of HashMap contains these fields.

final int hash;
final K key;
V value;
Node<K,V> next;`

2.Data Structure

Like other people already said, you need to care about data structure. HashMap may or may not be suitable here.

Are all vertex known beforehand? For example, all vertex are from 0 to 20K. In this case, I won't use a HashMap given this large dataset. Instead I would use a list of list, which significantly reduce memory of each node from 20 bytes to only 4 bytes. So that I only need 800MB memory!

However, if vertex spread over integer space, then this method is not feasible. But still, you may not use the data structure correctly. Did you initialize HashMap with enough capacity? HashMap has to rehash when it is relatively full, which is extremely costing. Similarly, did you initialize ArrayList with enough capacity? ArrayList has to resize when it is full, which is also costing.

Lastly, I noticed you used SynchronizedMap, which is really terrible idea in your case. SynchronizedMap is nothing but a mutex wrap around HashMap, it locks the entire HashMap when multiple threads modify the HashMap concurrently, which means there is NO parallelism in your code. Instead, you should use ConcurrentHashMap, which has significantly smaller granularity than SynchronizedMap. Intuitively explanation is, it only lock the linked list it is modifying, so now if several threads modify different linked list, then they could do this potentially in parallel.

3.Reading method

For reading this large file, you may want to checkout methods other than readLine. Other people already pointed out FileChannel in nio package. Also checkout MappedByteBuffer.

Conclusion

In conclusion, it's hard to give real advice unless you share your environment and pattern of your data. Optimizations are often based on specific scenario, not general.

Java: Fastest way to read ~200 million edges into memory from text file?

1 Answers1