This question is interesting. It would be better if you could provide more information.
One important point you are missing is, on what kind of machine are you going to reach that goal? How many memory does it have? How fast is the CPU? How many cores? How fast is the I/O?
But anyway, here are some analysis which may help. And if you could provide more information, then we could analyze more.
1.Memory
(Revised, I made a mistake in first answer. I didn't noticed you used an ArrayList)
So you are using a HashMap of ArrayList. But that doesn't guarantee the memory overhead.
Suppose Integer
and int
are 4 bytes, and reference is 8 bytes (I'm very likely to be wrong here, just think it as a pointer).
In the best case, say there's only one vertex linking to all other vertexes, and this vertex is the first number in your file. Then memory would be 200M * 8 bytes = 1.6 GB.
But in the worst case, still one vertex linking to others, but now this vertex is the second number in your file. Then memory would be 200M * 20 Bytes = 4 GB.
The reason for the worst case is that, after a skim of Java HashMap's source code, each node/entry of HashMap contains these fields.
final int hash;
final K key;
V value;
Node<K,V> next;`
2.Data Structure
Like other people already said, you need to care about data structure. HashMap may or may not be suitable here.
Are all vertex known beforehand? For example, all vertex are from 0 to 20K. In this case, I won't use a HashMap given this large dataset. Instead I would use a list of list, which significantly reduce memory of each node from 20 bytes to only 4 bytes. So that I only need 800MB memory!
However, if vertex spread over integer space, then this method is not feasible. But still, you may not use the data structure correctly. Did you initialize HashMap with enough capacity? HashMap has to rehash
when it is relatively full, which is extremely costing. Similarly, did you initialize ArrayList with enough capacity? ArrayList has to resize
when it is full, which is also costing.
Lastly, I noticed you used SynchronizedMap
, which is really terrible idea in your case. SynchronizedMap
is nothing but a mutex wrap around HashMap
, it locks the entire HashMap when multiple threads modify the HashMap concurrently, which means there is NO parallelism in your code. Instead, you should use ConcurrentHashMap
, which has significantly smaller granularity than SynchronizedMap
. Intuitively explanation is, it only lock the linked list it is modifying, so now if several threads modify different linked list, then they could do this potentially in parallel.
3.Reading method
For reading this large file, you may want to checkout methods other than readLine. Other people already pointed out FileChannel
in nio
package. Also checkout MappedByteBuffer
.
Conclusion
In conclusion, it's hard to give real advice unless you share your environment and pattern of your data. Optimizations are often based on specific scenario, not general.