1

I have this large file with the follow format:

Unique String \t Information

In my program I need to read this file to get the Information through the Unique String key. Since the performance is important, I can't read each line looking for the key everytime, besides I can't load the file in memory because it is too heavy. Then I'd like to read the file only once and then build an index with the String key and the position(in byte) of that in file. This index is something like a HashMap with the key been the Unique String and the value been the bytes in file where the key appears.

Seems that RandomAccessFile could do this, but I don't know how.

So, how can I build this index and then access an specific line by this index?

Marcelo Machado
  • 1,179
  • 2
  • 13
  • 33
  • 1
    Hint: `HashMap`would be useful. – jack jay Feb 06 '17 at 20:56
  • @BackSlash There is no problem to read the file to build the index, since it will be done only one time. What I need is to build the index(in bytes) to jump to the specific byte. The question that you said is duplicated there is no solution to build this index. – Marcelo Machado Feb 06 '17 at 21:05
  • @jackjay I think so, but I don't know how to access the specific byte – Marcelo Machado Feb 06 '17 at 21:10
  • @MarceloMachado what do you mean by "build an index to access a specific byte" - is it a specific byte for specific key ? why can you use HashMap where the key is the unique string and the value is byte array ? – Mzf Feb 06 '17 at 21:12
  • @MarceloMachado `I need to read this file to get the Information through the Unique String key` is much more important for solution rather than building an index. Once you make map you dont need to read specific byte from file. – jack jay Feb 06 '17 at 21:16
  • @Mzf maybe my question is not good enough to understand, but perhaps you could help-me. What I want is exactly this you said. How can I create this HashMap with these bytes and how can I go to this specific byte in a text file? – Marcelo Machado Feb 06 '17 at 21:17
  • @MarceloMachado again I'm not sure what are you asking.... (1) given a uniqe string - get a byte in a specific location (2) given a byte - find where this byte is located in the file. which one did you mean ? – Mzf Feb 06 '17 at 21:19
  • @Mzf The problem is I can't read each line of the file looking for the String because it is too slow. Then I need to access the line directly, to do this I think I need an Index that store the key and the bytes that indicates where your cursor should go to get the information. What I don't know is how to do so! – Marcelo Machado Feb 06 '17 at 21:32
  • @jackjay You say a map with the key been the unique String and the value been the information? If so, I can't do this because is too much thing to put in memory. – Marcelo Machado Feb 06 '17 at 21:34
  • @MarceloMachado so a HashMap between key to the line of the key is what you are looking for ? according to the next - it's not possible: http://stackoverflow.com/questions/2312756/how-to-read-a-specific-line-using-the-specific-line-number-from-a-file-in-java – Mzf Feb 06 '17 at 21:35
  • @Mzf This is what I want, but look at szgal comment. He said "And, without having an index of the positions of a byte in a file, the only way to know where those bytes are is to read it and look for it". what I want is to create this index! – Marcelo Machado Feb 06 '17 at 21:43
  • @MarceloMachado I have added a compile-able example to my answer. – matt Feb 07 '17 at 07:08

2 Answers2

2

The way I am going to suggest is to read the file, and keep track of the position. Store the position along the way in a map so you can look it up later.

The first way to do this is to use your file as a DataInput, and use the RandomAccessFile#readline

RandomAccessFile raf = new RandomAccessFile("filename.txt", "r");
Map<String, Long> index = new HashMap<>();

Now, how is your data stored? If it is stored line by line, and the ecoding conforms to the DataInput standards, then you can use.

long start = raf.getFilePointer();
String line = raf.readLine();
String key = extractKeyFromLine(line);
index.put(key, start);

Now anytime you need to go back and get the data.

long position = index.get(key);
raf.seek(position);
String line = raf.readLine();

Here is a complete example:

package helloworld;

import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.HashMap;
import java.util.Map;

/**
 * Created by matt on 07/02/2017.
 */
public class IndexedFileAccess {
    static String getKey(String line){
        return line.split(":")[0];
    }
    public static void main(String[] args) throws IOException {
        Map<String, Long> index = new HashMap<>();
        RandomAccessFile file = new RandomAccessFile("junk.txt", "r");
        //populate index and read file.
        String s;
        do{
            long start = file.getFilePointer();
            s = file.readLine();
            if(s!=null){
                String key = getKey(s);
                index.put(key, start);
            }
        }while(s!=null);

        for(String key: index.keySet()){
            System.out.printf("key %s has a pos of %s\n", key, index.get(key));
            file.seek(index.get(key));
            System.out.println(file.readLine());
        }
        file.close();

    }
}

junk.txt contains:

dog:1, 2, 3
cat:4, 5, 6
zebra: p, z, t

Finally the output is:

key zebra has a pos of 24
zebra: p, z, t
key cat has a pos of 12
cat:4, 5, 6
key dog has a pos of 0
dog:1, 2, 3

There are many caveats to this. For example, if you need a more robust encoding, then the first time you read it you'll want to create a reader that can manage the encoding, and just use your RandomAccessFile as an input stream. The readLine() method will fail if the lines are too large. Then you would have to devise your own strategy for extracting the key/data pair.

Marcelo Machado
  • 1,179
  • 2
  • 13
  • 33
matt
  • 10,892
  • 3
  • 22
  • 34
  • That is exactly what I was looking for! – Marcelo Machado Feb 06 '17 at 21:54
  • Matt I did what you said, but I am having a little problem, when I do this: long position = index.get(key); raf.seek(position); System.out.println(raf.readLine()); The result is always the next line. And the lengh of the line is not fixed – Marcelo Machado Feb 06 '17 at 22:51
  • @MarceloMachado I don't quite follow, the best case scenerio would be to make an example. Eg. What are you trying and what do you get. – matt Feb 07 '17 at 06:59
1

I need to read this file to get the Information through the Unique String key.

With respect to above question of yours, you have to read file line by line, split the read string using split() and put the values in Map as follows,

try {
  FileReader fileReader = new FileReader(fileName);

  BufferedReader bufferedReader = new BufferedReader(fileReader);

  Map<String, int> map = new HashMap<String, int>();
  int byte = 0;

  while((line = bufferedReader.readLine()) != null) {

           String arr[] = line.split("\t");  //make sure your file conatins data as you specified.
           map.put(arr[0], byte);

           byte += line.length() + 1;

  }   

  bufferedReader.close();         
 }
 catch(Exception ex) {
            System.out.println("unable to open file '" + fileName + "'");                
 }

Now you can access any information when you have specificString as follows,

 map.get("specificString"); // will return corresponding information as int type.
jack jay
  • 2,493
  • 1
  • 14
  • 27