How to find a name in unordered list of names in a 8GB flat file in Java

Question

Ok, so we have this problem and I know I can use InputStream to read stream instead of reading the whole file as that would cause the Memory issues.

Referring to this answer: https://stackoverflow.com/a/14037510/1316967

However, the concern is speed, as I would, in this case, be reading each line of the entire file. Considering this file contains millions of names in an unordered fashion and this operation has to be achieved in few seconds, how do I go about solving this problem.

Since your title says 'unordered', you do not have another option than reading the whole file (worst case). You could add a `break` in the loop, wehn you found the name. But this will only help if it is not the last name. — SilverNak, Mar 21 '17 at 09:04
Do you want to do the search only once or multiple times? Why is ordering the list not possible? — tak3shi, Mar 21 '17 at 09:15

score 5 · Answer 1 · edited May 23 '17 at 12:10

Because the list is unordered there is no alternative to reading the entire file.

If you're lucky, the first name is the name you're looking for: o(1).

If you're unlucky, it's the last name: O(n).

Apart from this, it doesn't matter if you do it the java.io way (Files.newBufferedReader()) or the java.nio way (Files.newByteChannel()), they both - more or less - perform the same. If the input file is line based (as in your case), you may use

Files.lines().filter(l -> name.equals(l)).findFirst();

which internally uses a BufferedReader.

If you really wan't to speed up things, you have to sort the names in the file (see How do I sort very large files), now you're able to read from an

EDIT: ordered list using an index

Once you have an ordered list, you could fast-scan and create an index using a TreeMap and then jump right to correct file position (use a RandomAccessFile or SeekableByteChannel) and read the name.

For example:

long blockSize = 1048576L;
Path file = Paths.get("yourFile");

long fileSize = Files.size(file);
RandomAccessFile raf = new RandomAccessFile(file.toFile(), "r");

//create the index
TreeMap<String, Long> index = new TreeMap<>();
for(long pos = 0; pos < fileSize; pos += blockSize) {
     //jump the next block
     raf.seek(pos);
     index.put(raf.readLine(), pos);
 }

 //get the position of a name
 String name = "someName";

 //get the beginning and end of the block
 long offset = Optional.ofNullable(index.lowerEntry(name)).map(Map.Entry::getValue).orElse(0L);
 long limit = Optional.ofNullable(index.ceilingEntry(name)).map(Map.Entry::getValue).orElse(fileSize);

 //move the pointer to the offset position
 raf.seek(offset);
 long cur;
 while((cur = raf.getFilePointer())  < limit){
      if(name.equals(raf.readLine())) {
          return cur;
      }
 }

The block size is a tradeoff between index-size, index-creation time and data-access time. The larger the blocks, the smaller the index and index-creation time but the larger the data-access time.

Can you explain 'fast-scan' in your approach? Don't you think that would involve reading the entire file anyway, which defeats the purpose of having all that Sorting and Creating Indexes. If we were anyway going to read the entire file, we may very well go with your 1st approach. Sorry if I misunderstood your 'fast-scan'. — sunny_dev, Mar 21 '17 at 10:19

Edd · Answer 2 · 2017-03-21T09:44:07.237

1

I would suggest to move the data to a database (checkout SQLite for a serverless option).

If that is not possible, you can try to have multiple threads reading the file, each starting at a different offset in the file and reading only a portion of the file.

You would have to use a RandomAccessFile . This will only be beneficial if you are on a RAID system, as benchmarked here: http://www.drdobbs.com/parallel/multithreaded-file-io/220300055?pgno=2

edited Mar 21 '17 at 09:44

answered Mar 21 '17 at 09:09

Edd

1,350
11
14

using multiple threads won't help you here as the physical disc is the shared and limit resource that only allows ONE thread to read it, thus multiple thread only wait for each other to get access to the file. Things get worse if you don't have an SSD as the threads will reposition the head all the time increasing acces latency. So the fastest way is sequential reading, unless you know exactly where to look for (aka ordered list) – Gerald Mücke Mar 21 '17 at 09:13
>>>>> This will not be beneficial if you have a magnetic disk instead of SSD though, as explained here: <<<<< – Edd Mar 21 '17 at 09:20
The SSD itself can access blocks in parallel but the commands to execute are queued and serialized. So the SSDs have the same concurrent access limitations as HDs (especially in combination with concurrent writes taking place on the disk) , but the impact is far less than with magnetic disks. – Gerald Mücke Mar 21 '17 at 09:32
1

Yes, only a RAID system will provide benefit from multithreading. http://www.drdobbs.com/parallel/multithreaded-file-io/220300055?pgno=2 – Edd Mar 21 '17 at 09:43

How to find a name in unordered list of names in a 8GB flat file in Java

2 Answers2

EDIT: ordered list using an index