0

I have a function to read in a tab delimited file which puts each column in a list and returns a list of lists with all the values from the column. This works fine for my small test file that I used with 1 column and 1850 rows, but I am now trying it with ~30k columns and it has been running for a few hours and still not finished.

How can I modify the code below to do this faster? If reading in a file if 30k rows with 1850 columns is faster i can also transpose the input files.

public static List<List<String>> readTabDelimited(String filepath) {
    List<List<String>> allColumns = new ArrayList<List<String>>();
    try {
        BufferedReader buf = new BufferedReader(new FileReader(filepath));
        String lineJustFetched = null;
        for (;;) {
            lineJustFetched = buf.readLine();
            if (lineJustFetched == null) {
                break;
            }
            lineJustFetched = lineJustFetched.replace("\n", "").replace("\r", "");
            for (int i = 0; i < lineJustFetched.split("\t").length; i++) {
                try {
                    allColumns.get(i).add(lineJustFetched.split("\t")[i]);
                } catch (IndexOutOfBoundsException e) {
                    List<String> newColumn = new ArrayList<String>();
                    newColumn.add(lineJustFetched.split("\t")[i]);
                    allColumns.add(newColumn);
                }
            }
        }
        buf.close();
    } catch (Exception e) {
        e.printStackTrace();
    }
    return allColumns;
}
Niek de Klein
  • 8,524
  • 20
  • 72
  • 143
  • 1
    How big is this file? More importantly, use a [CSV reader](http://opencsv.sourceforge.net/). – Boris the Spider Feb 21 '16 at 15:05
  • 2
    Why are you calling `split` three times? Just call it once, store the result and reuse it. – StepTNT Feb 21 '16 at 15:06
  • Do you need to store the full content of the file in memory ? – Guillaume Feb 21 '16 at 15:06
  • 1
    explaining what @StepTNT says: add 2 new variables, `String[] lineParts` and `int count`, instead of splitting line each time to get count or to get the ith item, `String lineParts[] = lineJustFetched.split("\t");` and `int count = lineParts.length;` add these just after replace \n and \r now use the new variables in the for loop and to read values, not from `split()...` everytime – Yazan Feb 21 '16 at 15:29
  • @BoristheSpider the file is 913 megabytes, I will look into CSV reader. StepTNT and Yazan thanks for the tip, I changed the code to only do split once – Niek de Klein Feb 21 '16 at 15:39
  • @NiekdeKlein that's going to be **a lot** of data. I recommend avoiding storing the whole lot in memory; but if you have to, use a real CSV reader that's been optimised and if you can pre-size the `List`. Finally, use an actual object and map your data using bean mapping; using `List>` is a definite sign of object phobia... – Boris the Spider Feb 21 '16 at 15:43
  • @BoristheSpider thanks for the tips – Niek de Klein Feb 21 '16 at 15:46

1 Answers1

0

I'm not sure if your try-catch is triggered a lot (I assume 30k times) but it's a very resource consuming thing.

try {
  allColumns.get(i).add(lineJustFetched.split("\t")[i]);
 } catch (IndexOutOfBoundsException e) {
   List<String> newColumn = new ArrayList<String>();
   newColumn.add(lineJustFetched.split("\t")[i]);
   allColumns.add(newColumn);
 }

In fact you're doing different stuff in your catch clause but i guess there is something like

if (i<allColumns.size()){    }
else    {   }

saves you a lot of resources in case the exceptions is thrown.

As next you should call the lineJustFetched.split("\t") only once. This is the main problem I guess. So you save about 2 more calls of this function:

 String[] tempList = lineJustFetched.split("\t");
     for (int i = 0; i < tempList.length; i++) {
         if(allColumns.size()>i){
             allColumns.get(i).add(tempList[i]);
         } else {
             List<String> newColumn = new ArrayList<String>();
             newColumn.add(tempList[i]);
             allColumns.add(newColumn);
         }
    }

So I've reduced the computing time from 1153229 nanoseconds to 354714 (3 times faster) nanoseconds for the input String

String lineJustFetched=" 1 \t 2 \t 3 \t 1 \t 2 \t 3 \t 1 \t 2 \t 3 \t 1 \t 2 \t 3 \t 1 \t 2 \t 3 \t 1 \t 2 \t 3 \t 5";
Jodn
  • 314
  • 1
  • 11