0

I have large CSV files containing more than 315 million rows and a single column. I have to process more than 50 such files at a time to get the results.

As I read more than 10 using a csv reader, it takes more than 12GB of RAM and is painfully slow. I can read only a chunk of the file to save memory but would spend more time in reading the file as it will read the whole file every time.

I have thought about loading them into a database and querying the data from there. However, I am not sure if this approach would help in any way. Can anyone please tell which is the most efficient way of handling such scenarios in Python?

Noman Bashir
  • 167
  • 1
  • 2
  • 9
  • Depends what type of manipulation you are doing with them. Database is one option, EC2 is another, CLI is a third, but then this all depends on what you are doing. – gold_cy Mar 27 '17 at 14:37
  • I take the values, perform simple multiplication and addition options, store the results, and move forward to the next chunk of the file. – Noman Bashir Mar 27 '17 at 15:11
  • Your problem is underspecified. How do the files relate to each other? Why do they need to be processed in parallel rather than sequentially? How many rows do you need in memory at a time to do a calculation? (For example, a cumulative sum would only need one row at a time.) – Steven Rumbalski Mar 27 '17 at 15:32
  • I'm voting to close this as "unclear what you are asking" because you have no responded to clarify your question. – Steven Rumbalski Mar 27 '17 at 17:20

2 Answers2

3

You will find the solution here Lazy Method for Reading Big File in Python?

Additionally, if you have a longer processing pipeline, you can look into Section 4.13. Creating Data Processing Pipelines in the book, Python Cookbook, 3rd edition by Beazly and Jones.

Community
  • 1
  • 1
Elan
  • 443
  • 5
  • 14
0

Check out ETLyte, a tool I've just open sourced. It's .NET, but you could call out to the EXE from Python. It's still a work in progress but I think it'll work for your situation.

With ETLyte, here would be the steps:

  1. Place files in Flatfiles folder, or whichever folder you specify in the config.json.
  2. Describe them with a JSON schema and place them in the Schemas folder, or whichever you specify (Note: If they all have the same schema [you said it's all a single column], then just change flatfile field in the schema to a regex that matches your files)
  3. When it comes to performing the addition/multiplication, you could create derived columns that perform that calculation.
  4. Run ETLyteExe.exe and allow the data to flow in

ETLyte is just getting started but it has a lot of features and a lot more on the roadmap. It also comes with an interactive REPL with word completion which wraps the SQLite DLL so you can interrogate the data without installing sqlite3. For an overview of the tool, look here.

sorrell
  • 1,801
  • 1
  • 16
  • 27