1

I am trying to read a .csv file with 20k+ lines, and each line has ~300 fields.

I am using my own code to read it line by line, then I separate the lines to fields, and convert the fields to corresponding data type (such as integer, double, etc). Then these data are transfered to class objects via their constructor.

However, I found it is not very efficient. It took about 1 min to read these 20k+ lines and create 20k+ objects.

I've googled about fast csv parser, and found there are many options. I've tried some of them, but not very satisfied with the time performance.

Does anyone have a better method to read large .csv files? Many thanks in advance.

ChangeMyName
  • 7,018
  • 14
  • 56
  • 93
  • if you post the code for what you're doing we can look for optimizations – mark Aug 14 '13 at 14:55
  • What are your requirements? "CSV" is rather ambigous. I once had the same problem with my own parser. It handled quoted strings, escaped quotes and text cells with newlines, was compatible with Excel's understanding of CSV. Another performance hit was using standard C++ `stringstream`s for conversions from VS 2008 which are pretty slow and introduce a global lock. Did you take a look at the proposals in http://stackoverflow.com/questions/1120140/csv-parser-in-c ? – mkluwe Aug 14 '13 at 15:20
  • How do you know that it is _parsing_ that is taking all the time and not, for example, the construction of the 20k+ objects? – dhavenith Aug 14 '13 at 15:27
  • @dhavenith I've ran a test, which shows that reading process took 90% of the total computational time. – ChangeMyName Aug 14 '13 at 16:20
  • If 90% of the processing time is spent on reading, then you need to examine which I/O package you are using and why it is slow. If the problem is getting the data off disk and into memory, then you need better disks (or local disk instead of network-mounted disk). If the problem is slowness of parsing the input after it is off disk, then you need to look at the I/O library and the layers you're using. – Jonathan Leffler Aug 14 '13 at 16:24

1 Answers1

2

An efficient method for parsing or for that matter processing of files is to read as much of the file into memory before you start parsing.

File I/O has been, since the dawn of computers, one of the slower parts of a computer system. For example, parsing your data may take 1 microsecond. Reading the data from a hard drive may take 1 millisecond == 1000 microseconds.

I've made programs faster by allocating a large array for the data then reading the data into the array. Next I process the data in the array and repeat until the entire file is processed.

Another technique is called memory mapping, where the OS handles reading the file into memory as needed.

Please edit your post to show the code where the bottleneck is.

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154