I have to parse a huge list (hundreds) of big .csv files (>1Gb each) to extract slices given a criteria. Criteria might change over time, so it should be reproducible.
I'm considering to use three different approaches:
- Good old cat|grep. Last resort, but not feasible over time if I wanted to automatize the whole process.
- Load and iterate each file with panda's csv read functions, and keep only the matching rows in a new csv file. Example
- Import every row in a database and query on demand, sqlite preferably, but could be MS SQL. Example
Since there's no way to avoid reading the files row-by-row, Which is the best method of those three in terms of performance? is there any better option?