1

Update: Thanks to Keith-Miller's suggestion below, I was able to use StreamReader.ReadBlock() to pull out a block of characters, then search through that string for the last row delimiter using String.LastIndexOf(). Fortunately in this case, the row delimiter was `r`n, while the line breaks in the cell were only `r. I'm still not at a final solution for this file because of another issue, but I will provide another update once I've found it!

Original post:

Suppose I have a CSV that looks like this:

ID   | Message
-----+------------------
1    | Buy milk
2    | Don't forget
     | to eat
     | candy
3    | Also pizza

or, as it would be saved in the text file:

ID,Message
1,"Buy milk"
2,"Don't forget
to eat
candy"
3,"Also pizza"

How could I import the records for ID 1 and 2 without the record for ID 3?

  • Get-Content -Head would require knowledge of how many line breaks are in each row.
  • StreamReader would return rows like to eat, once again needing knowledge of the line breaks in each row.
  • Stripping the line breaks from the file would also strip the line breaks from the end of the line, resulting in a one-row, many-column table.

For additional context, I'm trying to import this CSV in chunks because it is so large, so if there is a read-line-by-line (or even many-lines-at-a-time) solution, I would really appreciate it. As opposed to "Import the huge CSV and then pull out the rows you need."

Inflorescence
  • 86
  • 1
  • 7
  • 2
    That is not a valid CSV format. – Doug Maurer Oct 14 '22 at 16:34
  • 1
    _"Import the huge CSV and then pull out the rows you need."_, What's the condition that determines what are the rows you need? As Doug pointed out above, this is not a CSV and would require parsing. Are those rows you need always starting with an integer? – Santiago Squarzon Oct 14 '22 at 17:35
  • @SantiagoSquarzon The condition is ordinal; simply the "first X rows" is fine. I agree that it requires parsing, but this is a text file that I inherited and I can't like...go create a new one the right way. No, the rows do not start with an integer, and in fact there are 671 columns, further complicating matters. I wouldn't even be able to tell you which columns *could* contain line breaks, due to the aforementioned parsing issues. Is it a nightmare to deal with this file? Yes absolutely. But it is also my job, and I'd like to be able to deliver something. – Inflorescence Oct 14 '22 at 17:47
  • Each row does seem to start with `ObjectID()`, so it's likely possible that I'd be able to do something like `$streamReader.ReadLine()` with a conditional that says "if this doesn't start with ObjectID, it's a part of the same row," which would at least be a way to identify the beginning of each row. I could probably insert a pipe character in the raw text and use it as a row delimiter, if there's some cmdlet that lets you specify a row delimiter. – Inflorescence Oct 14 '22 at 18:00
  • You know what, I lied, your file is a proper CSV it's just that the `Message` column has multi-line strings. So you could definitely use `Import-Csv` and it's your best option to exclude rows where `ID` is not what you want. But this will require you to either process the CSV in chunks or read the CSV until you reach all desired values and then break the pipeline. But the problem will be if you break the pipeline then the next time you need to read the next chunk you would need to exclude all the rows read in the first processing to which I recommended you to use my `Split-Csv` function before – Santiago Squarzon Oct 14 '22 at 18:09
  • 1
    Have you examined the non-printing characters in the file on the off-chance that you can differentiate between the line-breaks within the `Message` column from the newlines htat separate rows in your csv? – Keith Miller Oct 14 '22 at 18:29
  • @KeithMiller It looks like they might be; at least in the first 10 lines, the non-row-breaking newline characters are LF, while the row-breaking ones are CRLF. – Inflorescence Oct 14 '22 at 20:23
  • 1
    So you could read your source a block at a time using `StreamReader.ReadBlock()`. Then `$Block.LastIndexOf("``r``n")` to find the end of the last complete row in that block. Everything before that would be complete rows, and `LastIndexOf + 2` is the start of your next block. The blocks of complete rows could be piped to `Convert-From-Csv`... – Keith Miller Oct 14 '22 at 21:58
  • @KeithMiller I think you've hit it! Thank you so much for your help! – Inflorescence Oct 18 '22 at 13:53
  • 1
    You're welcome. I don't have real-world experoemce dealmg with Large files, so no clue as to block size, largest string searchable with a regex, etc. Sp hopefully you or somebody else knows that. Please update you question with progress. And when you get it solved, please answer your own question if the answewr wasn't provided here. – Keith Miller Oct 19 '22 at 17:46

1 Answers1

0

For additional context, I'm trying to import this CSV in chunks because it is so large, so if there is a read-line-by-line (or even many-lines-at-a-time) solution, I would really appreciate it. As opposed to "Import the huge CSV and then pull out the rows you need."

Import-Csv already works like this - it'll read the header (if relevant), and then start outputting records row by row as it goes along - it won't actually wait until the whole file is imported before it spits out row data.

So to get the first 10 rows, simply do:

Import-Csv path\to\file.csv |Select-Object -First 10

Select-Object will interrupt Import-Csv as soon as it's received the first 10 input objects, the remaining CSV rows will remain "unread"

Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206