0

I'm working with a very large CSV (700 GB). As a part of processing this file, I'd like to break it into smaller chunks and process each chunk individually.

My primary goal is to avoid reading the entire file upfront! I want the processing to happen gradually through the file, so I want to avoid having to wait for the entire 700 GB file to be read (and allocated to memory) before I can start processing the data.

If I'm understanding it right, and based on my testing, both Get-Content | Select -skip and StreamReader.ReadLine() require iterating over each row of the file, starting from the beginning. So if I were breaking a 10-row csv into 3-row chunks, PowerShell reads the rows like this:

        PASS 1  |PASS 2
----------------+----------------
Line 1  READ    |CONTINUE
Line 2  READ    |CONTINUE
Line 3  READ    |CONTINUE
Line 4  BREAK   |READ
Line 5  --      |READ
Line 6  --      |READ
Line 7  --      |BREAK
Line 8  --      |--
Line 9  --      |--
Line 10 --      |--

Because of this, as you get into the hundreds-of-millions rows, the runtime is significantly impacted because of all the CONTINUEs. It would be a lot more optimal if I could pop off a few hundred thousand rows from the top of the CSV and remove them from the original CSV, just like you would if you were popping something off of the beginning of an array. Is this possible?

What I do not want to do:

  • I don't want to read through the entire CSV and split off chunks as I go, giving me one single pass through the rows. That's a huge upfront cost in terms of runtime, but also in terms of storage space. That's 1.4 TB of data before I even process it!
  • I don't want to write a copy of the CSV to another file without the first 100,000 rows; that would involve reading the entire CSV at once (many times).
Inflorescence
  • 86
  • 1
  • 7
  • 2
    Here you have a function to chunk your CSV into smaller pieces https://stackoverflow.com/a/73547526/15339544 – Santiago Squarzon Oct 11 '22 at 18:12
  • I don't really understand your requirement *"I don't want to read through the entire CSV"*. If you want to split a file into *n* smaller files, then you *have* to read through the entire input file. There is no way to break up a file, without reading it first. – zett42 Oct 11 '22 at 18:33
  • Using the same logic as in that answer you can process each chunk in memory instead of writing it to a chunk file. In any case, your question has many requirements and very little code – Santiago Squarzon Oct 11 '22 at 18:34
  • 1
    _"pop off a few hundred thousand rows from the top of the CSV and remove them from the original CSV"_ -- no, this isn't possible without rewriting the original file. You can't just set a new "start pointer" for a file. The only fast operations that affect the file size are at the end of the file (e. g. trimming a file at the end is O(1) complexity). – zett42 Oct 11 '22 at 18:43
  • It would be easier to help you if you describe the _processing_ that you want to do with the file. You seem to have already settled on an attempted solution, while there might be an easier way that doesn't require to split the input file. – zett42 Oct 11 '22 at 18:45
  • @SantiagoSquarzon Thank you for the CSV splitting code, but I only have one requirement for my question: **Can you remove the first X lines from a CSV without reading the entire file?** (e.g. "Is there a way to use a similar construction to StreamReader, StreamWriter, etc. to *remove* data, not just add it?") I do think processing the data in memory rather than writing a file is going to help, though! – Inflorescence Oct 11 '22 at 19:34
  • @zett42 I honestly think that this might be a complete enough answer for me. If you can't delete the beginning of a file, then the answer is simply that there's no way to get around reading the entire file upfront. And if I'm going to do that anyway, I might as well do all of the splitting while I'm at it, a la the code in Santiago's link. – Inflorescence Oct 11 '22 at 19:38
  • 2
    __Can you remove the first X lines from a CSV without reading the entire file?__, no if you want to do this, you must read the entire contents of the file excluding X number of lines, StreamReader makes it memory friendly tho because the file is being read line-by-line instead of reading the file as whole in memory. You can however read X number of lines with the StreamReader, do you processing logic and the continue with the X number of lines for the next chunk all in memory. My function is meant to split the CSV tho it can also be read in chunks in memory too – Santiago Squarzon Oct 11 '22 at 19:41
  • "*...That's a huge upfront cost in terms of runtime, but also in terms of storage space. That's 1.4 TB of data before I even process it!*" is a wrong assumption if you correctly use the [PowerShell PipelIne](https://learn.microsoft.com/powershell/module/microsoft.powershell.core/about/about_pipelines), it shouldn't matter. See also: [Read CSV sheet in batches using Powershell](https://stackoverflow.com/a/73774733/1701026) – iRon Oct 12 '22 at 07:29

0 Answers0