2

I have a set of CSV files that I need to import data from, the issue I'm running into is that the number of garbage rows above the header line, and their content, is always different. The header rows themselves are consistent so i could use that to detect what the starting point should be.

I'm not quite sure where to start, the files are structured as below.

Here there be garbage.
So much garbage, between 12 and 25 lines of it.
Header1,Header2,Header3,Header4,Header5
Data1,Data2,Data3,Data4,Data5

My assumption on the best method to do this would be to do something that checks for the line number of the header row and then a get-content function specifying the starting line number be the result of the preceding check.

Any guidance would be most appreciated.

2 Answers2

3

If the header line is as you say consistent, you could do something like this:

$header = 'Header1,Header2,Header3,Header4,Header5'
# read the file as single multiline string
# and split on the escaped header line

$data = ((Get-Content -Path 'D:\theFile.csv' -Raw) -split [regex]::Escape($header), 2)[1] |
        ConvertFrom-Csv -Header $($header -split ',')

As per your comment you really only wanted to do a clean-up on these files instead of importing data from it (your question says "I need to import data"), all you have to do is append this line of code:

$data | Export-Csv -Path 'D:\theFile.csv' -NoTypeInformation

The line ConvertFrom-Csv -Header $($header -split ',') parses the data into an array of objects (re)using the headerline that was taken off by the split.


A pure textual approach (without parsing of the data) still needs to write out the headerline, because by splitting the file content of this removed it from the resulting array:

$header = 'Header1,Header2,Header3,Header4,Header5'
# read the file as single multiline string
# and split on the escaped header line

$data = ((Get-Content -Path 'D:\theFile.csv' -Raw) -split [regex]::Escape($header), 2)[1]

# rewrite the file with just the header line
$header | Set-Content -Path 'D:\theFile.csv'
# then write all data lines we captured in variable $data
$data | Add-Content -Path 'D:\theFile.csv'
Theo
  • 57,719
  • 8
  • 24
  • 41
  • Theo, this is an amazing start, though when I try to put the cleaned data back into a CSV file I get a data formatted as: ` @{Header1=Data1; Header2=Data2; Header3=Data3, Header4=Data4 } @{Header1=Data1; Header2=Data2; Header3=Data3, Header4=Data4 } @{Header1=Data1; Header2=Data2; Header3=Data3, Header4=Data4 } ` – Rough-Coding Jul 14 '21 at 19:49
  • @Rough-Coding, if you just want to save the cleaned-up text back to the file, don't use `ConvertFrom-Csv` at all and pipe directly to `Set-Content`. – mklement0 Jul 14 '21 at 19:54
  • @Rough-Coding Oh, I thought you wanted to import the csv data just skip the garbage at the top. To save as cleaned up csv either do what mklement0 commented, or save with `$data | Export-Csv -Path 'cleaned.csv' -NoTypeInformation` – Theo Jul 14 '21 at 20:28
  • One more and then I think I'm good, writing it back to the file, I'm left without headers. I used Mklement0's method to output. – Rough-Coding Jul 14 '21 at 21:26
  • @Rough-Coding: Pipe to `ForEach-Object { Set-Content -NoNewLine file.csv -Value $header, $_ }` – mklement0 Jul 14 '21 at 22:08
  • @Rough-Coding You will only lose the headers if you indeed remove the `ConvertFrom-Csv` line (including the **important** `-Header $($header -split ',')` part) and replace that by just Set-Content. If you leave my code intact and save the data (including the header) using `$data | Export-Csv -Path 'cleaned.csv' -NoTypeInformation` as I suggested, you will end up with a valid CSV file, _with_ headers. – Theo Jul 15 '21 at 07:48
  • @Rough-Coding sorry for not responding earlier, but it was getting very late in my part of the world.. Please see my edits to explain the code – Theo Jul 15 '21 at 07:59
  • Thank you very much gents, this proved a far more elegant solution compared to the hack job I'd come up with. – Rough-Coding Jul 15 '21 at 13:10
3

To offer a slightly more concise (and marginally more efficient) alternative to Theo's helpful answer, using the -replace operator:

If you want to import the malformed CSV file directly:

(Get-Content -Raw file.csv) -replace '(?sm)\A.*(?=^Header1,Header2,Header3,Header4,Header5$)' |
  ConvertFrom-Csv 

If you want to save the cleaned-up data back to the original file (adjust -Encoding as needed):

(Get-Content -Raw file.csv) -replace '(?sm)\A.*(?=^Header1,Header2,Header3,Header4,Header5$)' |
  Set-Content -NoNewLine -Encoding utf8 file.csv

Explanation of the regex:

  • (?sm) sets the following regex options: single-line (s: make . match newlines too) and multi-line (m: make ^ and $ also match the start and end of individual lines inside a multi-line string).

  • \A.* matches any (possibly empty) text (.*) from the very start (\A) of the input string.

  • (?=...) is a positive lookahead assertion that matches the enclosed subexpression (symbolized by ... here) without consuming it (making it part of what the regex considers the matching part of the string).

    • ^Header1,Header2,Header3,Header4,Header5$ matches the header line of interest, as a full line.
mklement0
  • 382,024
  • 64
  • 607
  • 775