0

I have a CSV file of nearly 2 million rows with 3 columns (item, rating, user). I am able to transfer the data into a 2D String array or list. However, my issue arises when I am trying to parse through the arrays to create CSV files from because the application stops and I do not know how long I am expected to wait for the program to finish running.

Basically, my end goal is to be able to parse through large CSV file, create a matrix in which each distinct item represents a row and each distinct user represents a column with the rating being at the intersection of the user and item. With this matrix, I then create a cosine similarity matrix with the rows and columns represented by items with their cosine similarity being at the intersection of the two distinct items.

I already know how to create CSV files, but my issue falls within the large loop structures when creating other arrays for the purposes of comparison.

Is there a better way to be able to process and calculate large amounts of data so that my application doesn't freeze?

My current program does the following:

  1. Take large CSV file
  2. Parse through large CSV file
  3. Create 2D array resembling original CSV file
  4. Create list of distinct items (each distinct item being represented by an index number)
  5. Create list of distinct users (each distinct user being represented by an index number)
  6. Create 2D array of with row indexes representing items, column indexes representing users resulting in array[row][column] = rating
  7. Calculate cosine similarity of two matrices
  8. Create 2D array with both row and column indexes representing items resulting in array[row] [column] = cosine similarity

I noticed that my program freezes when it reaches steps 4 and 5 If I remove steps 4 and 5, it will still freeze at step 6

I have attached that portion of my code

      FileInputStream stream = null;
      Scanner scanner = null;

      try{
         stream = new FileInputStream(fileName);
         scanner = new Scanner(stream, "UTF-8");
         while (scanner.hasNextLine()){
             String line = scanner.nextLine();
             if (!line.equals("")){
                String[] elems = line.split(",");
                if (itemList.isEmpty()){
                  itemList.add(elems[0]);
                }
                else{
                  if (!itemList.contains(elems[0]))
                     itemList.add(elems[0]);
                }
                if (nameList.isEmpty()){
                  nameList.add(elems[2]);
                }
                else{
                  if (!nameList.contains(elems[2]))
                     nameList.add(elems[2]);
                }
                for (int i = 0; i < elems.length; i++){
                   if (i == 1){
                     if (elems[1].equals("")){
                        list.add("0");
                      }
                      else{
                        list.add(elems[1]);
                      }
                   }
                   else{
                     list.add(elems[i]);
                   }
                }
             }
         } 
         if (scanner.ioException() != null){
            throw scanner.ioException();
         }
      }
      catch (IOException e){
         System.out.println(e);
      }
      finally{
         try{
            if (stream != null){
               stream.close();
            }
         }
         catch (IOException e){
            System.out.println(e);
         }
         if (scanner != null){
            scanner.close();
         }
      }
Nevin
  • 769
  • 1
  • 11
  • 26
mm321
  • 79
  • 1
  • 1
  • 3
  • Please add a [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example). – samabcde Oct 02 '19 at 05:23
  • IMHO you have a memory issue here. The garbage collector tries to free memory to go ahead but does not get one. So increase your heapspace using -Xmx java parameter. Lower the memory footprint of your datastructure. Another possibility would be to put your CSV into a database (H2, Derby, or the big ones) and do your data queries there. – wumpz Oct 02 '19 at 07:16

1 Answers1

0

You can try setting -Xms and -Xmx. If you're using default values, it's possible you just need more memory allocated to the JVM.

In addition to that, you could modify your code so it doesn't treat everything as String. For the score column (which is presumably numeric), you should be able to parse that as a numeric value and store that instead of the string representation. Why? Strings use a lot more memory than numeric values. Even an empty string uses 40 bytes, whereas a single numeric value can be as little as one byte.

If a single byte could work (numeric range is -128 to 127), then you could replace ~80MB memory usage with ~2MB. Even using int (4 bytes) would be a huge improvement over String. If there are any other numeric (or boolean) values present in the data, you could make further reductions.

Kaan
  • 5,434
  • 3
  • 19
  • 41