I want to read text files and convert each word to a number. Then for each file write sequence of numbers instead of word in a new file. I used a HashMap to assigned just one number (identifier) for each word, for instance, the word apple is assigned to number 10 so whenever, I see apple in a text file I write 10 in the sequence. I need to have just one HashMap to prevent assigned more than one identifier to a word. I wrote the following code but it process file slowly. For instance, converting a text file with size 165.7 MB to a file of sequence took 20 hours. I need to convert 600 text file with the same size to sequence files. I want to know is there any way to improve the efficiency of my code . The following function is called for each text file.
public void ConvertTextToSequence(File file) {
try{
FileWriter filewriter=new FileWriter(path.keywordDocIdsSequence,true);
BufferedWriter bufferedWriter= new BufferedWriter(filewriter);
String sequence="";
FileReader fileReader = new FileReader(file);
BufferedReader bufferedReader = new BufferedReader(fileReader);
String line = bufferedReader.readLine();
while(line!=null)
{
StringTokenizer tokens = new StringTokenizer(line);
String str;
while (tokens.hasMoreTokens())
{
str = tokens.nextToken();
if(keywordsId.containsKey(str))
sequence= sequence+" "+keywordsId.get(stmWord);
else
{
keywordsId.put(str,id);
sequence= sequence+" "+id;
id++;
}
if(keywordsId.size()%10000==0)
{
bufferedWriter.append(sequence);
sequence="";
start=id;
}
}
String line = bufferedReader.readLine();
}
}
if(start<id)
{
bufferedWriter.append(sequence);
}
bufferedReader.close();
fileReader.close();
bufferedWriter.close();
filewriter.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
The constructor of that class is:
public ConvertTextToKeywordIds(){
path= new LocalPath();
repository= new RepositorySQL();
keywordsId= new HashMap<String, Integer>();
id=1;
start=1;}