4

I have to parse a huge list (hundreds) of big .csv files (>1Gb each) to extract slices given a criteria. Criteria might change over time, so it should be reproducible.

I'm considering to use three different approaches:

  • Good old cat|grep. Last resort, but not feasible over time if I wanted to automatize the whole process.
  • Load and iterate each file with panda's csv read functions, and keep only the matching rows in a new csv file. Example
  • Import every row in a database and query on demand, sqlite preferably, but could be MS SQL. Example

Since there's no way to avoid reading the files row-by-row, Which is the best method of those three in terms of performance? is there any better option?

JMartinez
  • 93
  • 2
  • 10

4 Answers4

4

After taking some measurements, I decided to use a mixed approach:

Facts:

  • Using the same hardware and data set, pandas is about 20% faster filtering big files than sqlite3 and mssql database engines querying the same data set, a gap which increases as fast as the data set grows.

  • In the other hand, storing the output in a db is about 50% faster and requires less space on hard disk.

So I ended up using pandas read_csv for processing the input (manually adjusting the size of the chunks until I found the best size/performance ratio around 200k lines per chunk) and using sqlalchemy for storing only the relevant data in sqlite3. Future refinements in the process would then be performed either with queries to the db or using pandas again.

JMartinez
  • 93
  • 2
  • 10
3

In the beginning you choose the way that will get you to the fastest result.

Based on your question I assume you're familiar with pandas; if that's the case you'll be able to implement it quickly, and it will also perform well.

Building a long term solution will mean you need to balance the needs of reprocessing the raw data (how often does that happen?) vs querying the 'output', and consider the proportion of results that are in the output. It makes sense to extract your slices and store the result for re-use, but only if that re-use exists. If there is little re-use of those slices, and the criteria change frequently, then you may simply want to reprocess every time you query the data.

Kirk Broadhurst
  • 27,836
  • 16
  • 104
  • 169
  • On spot!. Actually I'm not sure how many times I will have to re-process the set until I reached the conclusions of my PhD. What I know for sure is that I won't have more incoming data, so as long as the project continues, I will have the same data set. I'm also considering filtering the data with pandas and just store in a database only the relevant sets. Thank you for your senior advise! – JMartinez Mar 28 '18 at 20:13
0

Using SQL database will "slow" your query, as this content does not reside in your RAM, but will be more scalable as the data grows larger. It will also allow you to host the data on a different computer so you don't have to share computing power/can use cloud services.

Narshe
  • 427
  • 8
  • 22
0

Since there's no way to avoid reading the files row-by-row

Although that is true at some level, it may not be necessary for your code to process the source files RBAR. You could conceivably use something like SQL Server's bcp utility (or a SQLite equivalent) to dump the rows into a database and then extract the "slices" you need into pandas for analysis.

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • Thank you so much for your idea, unfortunately, the performance of the query drop when the size of the database reaches some threshold ( probably the available ram). See my own answer for more details. kind regards! – JMartinez Mar 29 '18 at 16:39