0

I need to parse a csv file at work. Each line in the file is not very long, only a few hundred characters. I used the following code to read the file into memory.

def lines = []
new File( fileName ).eachLine { line -> lines.add( line ) }

When the number of lines is 10,000, the code works just fine. However, when I increase the number of lines to 100,000. I got this error:

java.lang.OutOfMemoryError: Java heap space

For 10,000 lines, the file size is about 7 MB, and ~70 MB for 100,000 lines. So, how would you solve this problem? I know increasing the heap size is a work-around. But are there any other solutions? Thank you in advance.

JBT
  • 8,498
  • 18
  • 65
  • 104
  • 3
    You could process each line instead of saving all the lines in memory, or for some reason you really need to have them in memory? – gurbieta Aug 26 '13 at 18:13
  • I am not familiar with groovy but is 'new File( fileName ).eachLine { line -> lines.add( line ) }' creating new object every time it reads a line from the csv file? I personally did the same stuff with python and never get any error. – Prateek Aug 26 '13 at 18:26
  • 1
    You are adding every line to a list in memory. The list `lines` is getting fat and causing OOM – Will Aug 26 '13 at 18:40
  • What are you doing with the lines array/list you define on line one? What problem are you solving by having each line stored in an array? – Brian Aug 26 '13 at 18:42
  • @Brian: It is a CSV file. Each line contains multiple fields. Basically I need to break a line into those fields. – JBT Aug 26 '13 at 20:03
  • @JBT so you don't need all the lines in memory You can process one line at a time, insert into database, move to next line. Get rid of the lines.add() and just implement your logic in the closure that you want to perform with each line. – Brian Aug 29 '13 at 14:17

2 Answers2

1
def lines = []

In groovy, this creates an ArrayList<E> with size 0 and no preallocation of the internal Object[].

When adding items, if capacity is reached, a new ArrayList is created. The larger the list, the more time spent reallocating a new list to accommodate new entries. I suspect that's where your memory issue occurs because, although I'm not exactly sure how ArrayList allocates a new list, if you're getting OOM for a relatively small data set, that's where I'd look first. For 100,000 entries, you create a new list roughly 29 times (assuming expansion factor of 1.5) when starting with an empty ArrayList.

If you have a general idea how large the list needs to be, just set the initial capacity, doing so avoids all the reallocating nonsense; see if this works:

def lines = new ArrayList<String>(100000)
Community
  • 1
  • 1
raffian
  • 31,267
  • 26
  • 103
  • 174
0

Assuming that you are likely trying to place the CSV file in a database you can do something like this. The key groovy feature is splitEachLine(yourDelimiter) and using the fields array in the closure.

import groovy.sql.*

def sql = Sql.newInstance("jdbc:oracle:thin:@localhost:1521:ORCL",
    "scott", "tiger", "oracle.jdbc.driver.OracleDriver")

//define a variable that matches a table definition (jdbc dataset
def student = sql.dataSet("TEMP_DATA");
//now iterate over the csv file splitting each line on commas and load the into table.
new File("C:/temp/file.csv").splitEachLine(","){ fields ->
//insert each column we have into the temp table.
 student.add(
        STUDENT_ID:fields[0],
        FIRST_NAME:fields[1],
        LAST_NAME:fields[2]
    )
}
//yes the magic has happened the data is now in the staging table TEMP_DATA.
println "Number of Records  " + sql.firstRow("Select count(*) from TEMP_DATA")
Brian
  • 13,412
  • 10
  • 56
  • 82