I'm working with a very large CSV (700 GB). As a part of processing this file, I'd like to break it into smaller chunks and process each chunk individually.
My primary goal is to avoid reading the entire file upfront! I want the processing to happen gradually through the file, so I want to avoid having to wait for the entire 700 GB file to be read (and allocated to memory) before I can start processing the data.
If I'm understanding it right, and based on my testing, both Get-Content | Select -skip
and StreamReader.ReadLine()
require iterating over each row of the file, starting from the beginning. So if I were breaking a 10-row csv into 3-row chunks, PowerShell reads the rows like this:
PASS 1 |PASS 2
----------------+----------------
Line 1 READ |CONTINUE
Line 2 READ |CONTINUE
Line 3 READ |CONTINUE
Line 4 BREAK |READ
Line 5 -- |READ
Line 6 -- |READ
Line 7 -- |BREAK
Line 8 -- |--
Line 9 -- |--
Line 10 -- |--
Because of this, as you get into the hundreds-of-millions rows, the runtime is significantly impacted because of all the CONTINUE
s. It would be a lot more optimal if I could pop off a few hundred thousand rows from the top of the CSV and remove them from the original CSV, just like you would if you were popping something off of the beginning of an array. Is this possible?
What I do not want to do:
- I don't want to read through the entire CSV and split off chunks as I go, giving me one single pass through the rows. That's a huge upfront cost in terms of runtime, but also in terms of storage space. That's 1.4 TB of data before I even process it!
- I don't want to write a copy of the CSV to another file without the first 100,000 rows; that would involve reading the entire CSV at once (many times).