Reading huge file in Java

Question

I read a huge File (almost 5 million lines). Each line contains Date and a Request, I must parse Requests between concrete **Date**s. I use BufferedReader for reading File till start Date and than start parse lines. Can I use Threads for parsing lines, because it takes a lot of time?

have you tried reading the file sequentially, without parsing, to measure if it is too slow for your use-case? — mxb, Jul 07 '14 at 15:19
it takes only few seconds to read file, without parsing, but it takes almosst 150 seconds with parsing — user3229532, Jul 07 '14 at 15:21
and 150 seconds is too much? what is your use-case, you have to do that inline in a web app or this is a batch program? — mxb, Jul 07 '14 at 15:22
use profiler to determine performance bottlenecks. if disk i/o is the performance bottleneck use [memory mapped file](http://ashkrit.blogspot.ru/2012/11/power-of-java-memorymapped-file.html) approach to process your file. — , Jul 07 '14 at 15:24
what is the date format? MM/DD/YYYY? you can try with a custom date parser — mxb, Jul 07 '14 at 15:24
date parsing is not slow, I want to use threads for parsing lines — user3229532, Jul 07 '14 at 15:25
If the dates are sequential, you can use a binary search approach to reduce your complexity from n to logn. — njzk2, Jul 07 '14 at 15:29
You could put each line into a Queue and have threads read from the queue to process them (BlockingQueue, threads reading from it). But unless the Request takes any meaningful time to parse (which I'd guess is unlikely if it's a single line in the file), you're probably not going to see a significant performance gain (the overhead of thread blocking will out-weigh any parallel processing gains). It's also likely that just reading the file will be the slowest part anyway so threading the parsing won't really help. But the best approach is just to test and see. — user1676075, Jul 07 '14 at 15:31
If date parsing is not slow, and file reading is not slow...then what is slow here? — MxLDevs, Jul 07 '14 at 15:35
duplicate? ---> http://stackoverflow.com/questions/5868369/how-to-read-a-large-text-file-line-by-line-using-java — kiltek, Jul 07 '14 at 15:40
NO, my problem is not a reading file, my problem is how to use threads for parsing lines — user3229532, Jul 07 '14 at 15:45
@user3229532 Then your question subject and the tags you picked are very misleading. You should edit them to say what you actually want. — Philipp, Jul 07 '14 at 16:18
Are you reparsing the entire 5 million-line file more than once? — rob, Jul 07 '14 at 16:53

score 0 · Answer 1 · edited Jul 09 '14 at 00:17

A good way to parallelize a lot of small tasks is to wrap the processing of each task with a FutureTask and then pass each task to a ThreadPoolExecutor to run them. The executor should be initalized with the number of CPU cores your system has available.

When you call executor.execute(future), the future will be queued for background processing. To avoid creating and destroying too many threads, the ScheduledThreadPoolExecutor will only create as many threads as you specified and execute the futures one after another.

To retrieve the result of a future, call future.get(). When the future hasn't completed yet (or wasn't even started yet), this method will freeze until it is completed. But other futures get executed in background while you wait.

Remember to call executor.shutdown() when you don't need it anymore, to make sure it terminates the background threads it otherwise keeps around until the keepalive time has expired or it is garbage-collected.

tl;dr pseudocode:

 create executor
 for each line in file
     create new FutureTask which parses that line
     pass future task to executor
     add future task to a list
 for each entry in task list
     call entry.get() to retrieve result
 executor.shutdown()

score 0 · Answer 2 · answered Jul 07 '14 at 16:43

It isn't entirely clear from your question, but it sounds like you are reparsing your 5 million-line file every time a client requests data. You certainly can solve the problem by throwing more threads and more CPU cores at it, but a better solution would be to improve the efficiency of your application by eliminating duplicate work.

If this is the case, you should redesign your application to avoid reparsing the entire file on every request. Ideally you should store data in a database or in-memory instead of processing a flat text file on every request. Then on a request, look up the information in the database or in-memory data structure.

If you cannot eliminate the 5 million-line file entirely, you can periodically recheck the large file for changes, skip/seek to the end of the last record that was parsed, then parse only new records and update the database or in-memory data structure. This can all optionally be done in a separate thread.

score 0 · Answer 3 · answered Jul 07 '14 at 17:00

Firstly, 5 million lines of 1000 characters is only 5Gb, which is not necessarily prohibitive for a JVM. If this is actually a critical use case with lots of hits then buying more memory is almost certainly the right thing to do.

Secondly, if that is not possible, most likely the right thing to do is to build an ordered Map based on the date. So every date is a key in the map and points to a list of line numbers which contain the requests. You can then go direct to the relevant line numbers.

Something of the form

HashMap<Date, ArrayList<String>> ()

would do nicely. That should have a memory usage of order 5,000,000*32/8 bytes = 20Mb, which should be fine.

You could also use the FileChannel class to keep the I/O handle open as you go jumping from on line to a different line. This allows Memory Mapping.

See http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html

And http://en.wikipedia.org/wiki/Memory-mapped_file

Reading huge file in Java

3 Answers3