optimize a java code that match 2 text files having enormous size

Question

i have a file containing the index of the document and publication date:

0, 2012-05-26T00:00:00Z

1, 2012-05-26T00:00:00Z

5, 2010-06-26T00:00:00Z

10, 2014-05-26T00:00:00Z

and a second text file containing the term frequency and the index of the doc that belongs to him:

was, 15, 1

kill, 10,1

tunisia, 5, 5

peace, 1, 0

i have this method that match both of the files so i can get a third file with this form:

was, 15, 2012-05-26T00:00:00Z

kill, 10,2012-05-26T00:00:00Z

tunisia, 5, 2010-06-26T00:00:00Z

peace, 1, 2012-05-26T00:00:00Z

I tested the method of a test file and it work fine but my file's size is 1T so my program has been in execution for 4 days and still working. would you plz help me to optimize it or give me another method.

 public void matchingDateTerme (String pathToDateFich, String pathTotermeFich)  {

  try {

        BufferedReader inTerme = new BufferedReader(new FileReader(pathTotermeFich));
        BufferedReader inDate = new BufferedReader(new FileReader(pathToDateFich));
        String lineTerme,lineDate;
        String idFich, idFichDate,dateterm,key;
        Hashtable<String, String> table = new Hashtable<String, String>();
        String[] tokens,dates;
        Enumeration ID=null;
        File tempFile = new File(pathTotermeFich.replace("fichierTermes", "fichierTermes_final"));
        FileWriter fileWriter =new FileWriter(tempFile);
        BufferedWriter writer = new BufferedWriter(fileWriter);

        //read file date
        while ((lineDate = inDate.readLine()) != null) {
            dates = lineDate.split(", ");
            idFichDate = dates[0].toLowerCase();
            dateterm=dates[1];
            table.put(idFichDate, dateterm);
        }

        while ((lineTerme = inTerme.readLine()) != null) {
            tokens = lineTerme.split(", ");
            idFich = tokens[2].toLowerCase();
            String terme=tokens[0];
            String freq=tokens[1];
            //lire hachtable
            ID = table.keys();
            while(ID.hasMoreElements()) {
               key = (String) ID.nextElement();
               if(key.equalsIgnoreCase(idFich)){
                   String line=terme+", "+freq+", "+table.get(key);
                   System.out.println("Line: "+line);
                   writer.write(line);
                   writer.newLine();
               }
            }
        }


        writer.close();
        inTerme.close();
        inDate.close();

    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }



}

you should https://en.wikipedia.org/wiki/Divide_and_conquer_algorithms split your input files with one file per index modulo, merge files for same modulo, build the file that aggregate merged files. You can then find the fastest diviser (https://en.wikipedia.org/wiki/Modulo_operation) IT WILL LOT FASTER since you compute rows only for matching modulo — avianey, Oct 12 '16 at 14:28
you should be able to run it in less than one day with a standard hard drive — avianey, Oct 12 '16 at 14:30
another approach could be to generate fixed line length files and perform a soart of the lines using memory map file, then run your algo with the sorted files. I think the divide and conquer is faster — avianey, Oct 12 '16 at 14:32
would you please explain briefly your method i didn't get it. thx — Imen Majdoubi, Oct 12 '16 at 15:01

score 1 · Answer 1 · edited May 23 '17 at 12:17

1

You are not using the Hashtable for what it is : An object that maps keys to values

Iterating over keys is useless and expensive, just use the get method :

if (table.get(idFich) != null) {
    String line = terme + ", " + freq + ", " + table.get(key);
    System.out.println("Line: " + line);
    writer.write(line);
    writer.newLine();
}

As VGR said in comment, using a HashMap which is not synchronized, will be faster. More information here

edited May 23 '17 at 12:17

Community

1
1

answered Oct 12 '16 at 14:22

ToYonos

16,469
2
54
70

This is true, but it’s not enough. The code is doing a case-insensitive match. So the Hashtable needs to be built with keys that have been lowercased, and the key passed to `get` also needs to be lowercased. Also, a non-synchronized Map like HashMap will be faster. – VGR Oct 12 '16 at 14:25
so if i change this part it will save much time ? – Imen Majdoubi Oct 12 '16 at 14:27
@VGR : already done. Insertion : `idFichDate = dates[0].toLowerCase();` Reading : `idFich = tokens[2].toLowerCase();` I agree for the HashMap part. – ToYonos Oct 12 '16 at 14:28
@Imen Majdoubi : yes, a lot but still it will take time. – ToYonos Oct 12 '16 at 14:29
i will change the HashMap part :) but can i run two execution at the same time on eclipse? because i don't want to cut down the execution and start it again with the modified code ? – Imen Majdoubi Oct 12 '16 at 14:50
With the same set of files ? No. Kill the other execution. – ToYonos Oct 12 '16 at 14:53

score 0 · Answer 2 · edited Apr 13 '17 at 12:22

0

There are a couple of considerations.

Does this absolutely have to be done in java? If yes, can you sort the files before you start reading them?
Do you have to run through the files in a single pass (highly doubt). If not, you should split the sorted files into parts, and only run through a subset of entries in the part-file.
Are both the files in excess of 1T? If not, you should start with the file with lesser size.

Given file1:

0,2012-05-26T00:00:00Z
1,2012-05-26T00:00:00Z
5,2010-06-26T00:00:00Z
10,2014-05-26T00:00:00Z

and file2:

was,15,1
kill,10,1
tunisia,5,5
peace,1,0

Here is an awk-based solution based on updated inputs:

awk -F',' 'FNR==NR{a[$1]=$2;next}{if(a[$3]==""){a[$3]=0}; print $1,",",$2,",",a[$3]} ' file1 file2

Output:

was , 15 , 2012-05-26T00:00:00Z
kill , 10 , 2012-05-26T00:00:00Z
tunisia , 5 , 2010-06-26T00:00:00Z
peace , 1 , 2012-05-26T00:00:00Z

This answer was helpful for me to derive above solution.

edited Apr 13 '17 at 12:22

Community

1
1

answered Oct 12 '16 at 14:35

Mahesh

5,158
2
15
20

the file is sorted i just gave u an example but my file is sorted alphabetically. – Imen Majdoubi Oct 12 '16 at 14:42
even if i split it i will have to execute it on all parts than merge all of them so where is the point? and i started with the file with lesser size //read file date – Imen Majdoubi Oct 12 '16 at 14:45
and if u have another solution even if not in java but will save me much time i will be thankful – Imen Majdoubi Oct 12 '16 at 14:52
this will not solve the overall complexity of O(n*n)... large files will still be prossessed for days... – avianey Oct 12 '16 at 17:59
@Mahesh my input file is not the same with the one u put it. i have a space before the last column ( was, 15, 1) so your script is not working.would you plz give me a solution – Imen Majdoubi Oct 12 '16 at 19:26

score 0 · Answer 3 · answered Oct 12 '16 at 15:14

You should use https://en.wikipedia.org/wiki/Divide_and_conquer_algorithms approach with the following pseudo algo :

If A and B are your two large files
Open file A(1..n) for writing
Open file A for reading
  for line in file A
    let modulo = key % n
    write line in file A(modulo)
Open file B(1..n) for writing
Open file B for reading
  for line in file B
    let modulo = key % n
    write line in file B(modulo+1)
for i = 1..n
  Open file R(i) for writing
  Open files A(i) and B(i)
    merge those files into R(i) using key matching as you do
Open file R for writing
for i = 1..n
  append R(i) to R

try using n = 1024 if your key are uniform it will end up matching files of 1GB

you need free space on your disk (three time the size of A+B if you do not clean the files)

@Mahesh thank u for ur help would this work on windows 10 ? if i install the Linux Bash Shell ? — Imen Majdoubi, Oct 12 '16 at 17:02

optimize a java code that match 2 text files having enormous size

3 Answers3