Improve the speed of reading and writing big files with Buffered Write/Reader

Question

I want to read text files and convert each word to a number. Then for each file write sequence of numbers instead of word in a new file. I used a HashMap to assigned just one number (identifier) for each word, for instance, the word apple is assigned to number 10 so whenever, I see apple in a text file I write 10 in the sequence. I need to have just one HashMap to prevent assigned more than one identifier to a word. I wrote the following code but it process file slowly. For instance, converting a text file with size 165.7 MB to a file of sequence took 20 hours. I need to convert 600 text file with the same size to sequence files. I want to know is there any way to improve the efficiency of my code . The following function is called for each text file.

public void ConvertTextToSequence(File file) {
    try{

        FileWriter filewriter=new FileWriter(path.keywordDocIdsSequence,true);
        BufferedWriter bufferedWriter= new BufferedWriter(filewriter);

        String sequence="";
        FileReader fileReader = new FileReader(file);
        BufferedReader bufferedReader = new BufferedReader(fileReader);
        String line = bufferedReader.readLine();
        while(line!=null)
        {
            StringTokenizer tokens = new StringTokenizer(line); 

                    String str;
                    while (tokens.hasMoreTokens()) 
                    {
                        str = tokens.nextToken();
                         if(keywordsId.containsKey(str))
                              sequence= sequence+" "+keywordsId.get(stmWord);
                         else
                         {
                              keywordsId.put(str,id);
                              sequence= sequence+" "+id;
                              id++;
                          }


                         if(keywordsId.size()%10000==0)
                         {
                              bufferedWriter.append(sequence);
                              sequence="";

                               start=id;
                         }

                    }
                    String line = bufferedReader.readLine();
                }
        }

        if(start<id)
        {

              bufferedWriter.append(sequence);
        }

        bufferedReader.close();
        fileReader.close();

        bufferedWriter.close();
         filewriter.close();
    }
    catch(Exception e)
    {
        e.printStackTrace();
    }

}

The constructor of that class is:

public ConvertTextToKeywordIds(){
   path= new LocalPath();
   repository= new RepositorySQL();
   keywordsId= new HashMap<String, Integer>();
   id=1;
   start=1;}

That code would not compile. If you want to tell you why your real code is slow, post your real code. What I can already tell is that appending to a string and wait for the map to have 10000 elements before writing the string to the writer is very, very couter-productive. Write to the writer directly, and let it do its job: buffering. — JB Nizet, Jan 11 '16 at 15:42
@JBNizet my real code is really big and complicated because I clean each token before I insert it in HashMap. I solve some error in my code I think you can compile the function now. — Suri, Jan 11 '16 at 16:09

Josh Kergan · Accepted Answer · 2016-01-11T16:12:08.340

I suspect that the speed of your program is tied to the rehashing of the hash map as the number of words grows. Each rehash can incur a significant time penalty as the size of the hash map grows. You could try and estimate the number of unique words you expect and use that to initialize the hash map.

As mentioned by @JB Nizet you may want to write directly to the buffered writer rather than waiting to accumulate a number of entries. Since the buffered writer is already set up to write only when it has accumulated enough changes.

score 1 · Answer 2 · answered Jan 11 '16 at 15:54

1

Your most effective performace boost is probably using StringBuilder instead of String for your sequence.

I would also write and flush the sequence each time it exceeds a certain length rather than whenever you've added 10000 words to your map.

This map could get pretty huge - have you considered improving that? If you hit millions of entries you may get better performance using a database.

answered Jan 11 '16 at 15:54

OldCurmudgeon

64,482
16
119
213

Thanks for your suggestions. Why do you suggest StringBuilder instead of String? I need to have all information in main memory so I cannot use database. – Suri Jan 11 '16 at 16:06
@Suri - Concatenating `String` is very expensive - `StringBuilder ` is designed to be much better when concatenating. – OldCurmudgeon Jan 11 '16 at 16:08
1

@Suri - See [String builder vs string concatenation](http://stackoverflow.com/q/18453458/823393) for discussion. – OldCurmudgeon Jan 11 '16 at 16:11

Improve the speed of reading and writing big files with Buffered Write/Reader

2 Answers2