Powershell parse parts of a text file and save to CSV

Question

All, I'm very new to powershell and am hoping someone can get me going on what I think would be a simple script.

I need to parse a text file, capture certain lines from it, and save those lines as a csv file.

For example, each alert is in its own text file. Each file is similar to this:

--start of file ---

Name John Smith
Dept Accounting
Codes bas-2349,cav-3928,deg-3942
iye-2830,tel-3890
Urls hxxp://blah.com
hxxp://foo.com, hxxp://foo2.com
Some text I dont care about
More text i dont care about
Comments
---------
"here is a multi line
comment I need
to capture"
Some text I dont care about
More text i dont care about
Date 3/12/2013

---END of file---

For each text file if I wanted to write only Name, Codes, and Urls to a CSV file. Could someone help me get going on this?

I'm more a PERL guy so I know I could write a regex for capturing a single line beginning with Name. However I am completely lost on how I could read the "Codes" line when it might be one line or it might be X lines long until I run into the Urls field.

Any help would be greatly appreciated!

How much data are you looking to process. PS may not be the best choice, unless you are constrained otherwise. [This answer](http://stackoverflow.com/a/4192419/326543) talks about perf benchmark on PS text processing — Srikanth Venugopalan, Mar 13 '13 at 03:38

score 0 · Answer 1 · answered Mar 13 '13 at 07:39

If the file is not too big to be processed in memory, the simple way is to read it as an array of strings. (What too big means is subject to your system. Anything sub-gigabyte should work without too much a hickup.)

After you've read the file, set up a head and tail counters to point to element zero. Move the tail pointer row by row forward, until you find the date row. You can match data with regexps. Now you know the start and end of a single record. For the next record, set head counter to tail+1, tail to tail+2 and start scanning rows again. Lather, rinse, repeat until end of array is reached.

When a record is matched, you can extract name with a regex. Codes and Urls are a bit trickier. Match the Codes row with a regex. Extract it and all the next rows unless they do not match the code pattern. Same goes to Urls data. If the file always has whitespace padding on rows that are data to previous Urls and Codes, you could use match whitespace count with a regexp to get data rows too.

score 0 · Answer 2 · answered Mar 13 '13 at 07:50

Maybe something line this would to it:

foreach ($Line in gc file.txt) {
    switch -regex ($Line) {
        '^(Name|Dept|Codes|Urls)' {
            $Capture = $true
            break
        }
        '^[A-Za-z0-9_-]+' {
            $Capture = $false
            break
        }
    }
    if ($Capture) {
        $Line
    }
}

If you want the end result as a CSV file then you may use the Export-Csv cmdlet.

Ansgar Wiechers · Answer 3 · 2013-03-14T08:13:32.103

If all files have the same structure you could do something like this:

$srcdir  = "C:\Test"
$outfile = "$srcdir\out.csv"

$re = '^Name (.*(?:\r\n .*)*)\r\n' +
      'Dept .*(?:\r\n .*)*\r\n' +
      'Codes (.*(?:\r\n .*)*)\r\n' +
      'Urls (.*(?:\r\n .*)*)' +
      '[\s\S]*$'

Get-ChildItem $srcdir -Filter *.txt | % {
  [io.file]::ReadAllText($_.FullName)
} | Select-String $re | % {
  $f = $_.Matches | % { $_.Groups } | ? { $_.Index -gt 0 }
  New-Object -TypeName PSObject -Prop @{
      'Name'  = $f[0].Value;
      'Codes' = $f[1].Value;
      'Urls'  = $f[2].Value;
    }
} | Export-Csv $outfile -NoTypeInformation

mjolinor · Accepted Answer · 2013-03-16T22:37:04.233

0

Text parsing usually means regex. With regex, sometimes you need anchors to know when to stop a match and that can make you care about text you otherwise wouldn't. If you can specify that first line of "Some text I don't care about" you can use that to "anchor" your match of the URLs so you know when to stop matching.

$regex = @'
(?ms)Name (.+)?
 Dept .+?
 Codes (.+)?
 Urls (.+)?
 Some text I dont care about.+
 Comments
 ---------
 (.+)?
 Some text I dont care about 
'@

$file = 'c:\somedir\somefile.txt'
[IO.File]::ReadAllText($file) -match $regex
if ([IO.File]::ReadAllText($file) -match $regex)
  {
   $Name = $matches[1]
   $Codes = $matches[2] -replace '\s+',','
   $Urls = $matches[3] -replace '\s+',','
   $comment = $matches[4] -replace '\s+',' '
  }

$Name
$Codes
$Urls
$comment

edited Mar 16 '13 at 22:37

answered Mar 13 '13 at 10:18

mjolinor

66,130
7
114
135

The OP was specifically asking for help with continued lines. – Ansgar Wiechers Mar 13 '13 at 13:44
The file read method was incorrect (corrected now). Other than that this is a multi-line regex - i.e. it is for matching and capturing data from continued lines. – mjolinor Mar 13 '13 at 23:14
I see, you tailored the regexp to the literal text from the question while I was assuming that the keywords would be at the beginning of the line. The OP will probably have to clarify that. However, your regexp relies on knowledge about text that the OP says he doesn't care about. That text may differ from file to file, which would be a problem. – Ansgar Wiechers Mar 14 '13 at 08:12
The regex is easily adjustable for the spacing if it isn't exactly as preseented. The OP doesn't care about the text that follows because it doesn't contain data he wants to capture. I used the literal text provided as an example, following the OP's lead. As long as it's predictable substitution of actual text to match will work. That's only a problem if that text is not predictable. Given the context of the surrounding text, I think that's unlikely. – mjolinor Mar 14 '13 at 10:30
mjolinor, I like the direction you are taking this. However I can not get that code to output anything. I have pointed it to a file on my local filesystem using the test data I posted above. It reads the file fine, because if i give it an invalid file name powershell complains to me. However it runs but it does not display anything. I put it in a file ending in .ps1 and executed via powershell using .\my-script.ps1 What am I doing wrong? – J. S. Mar 16 '13 at 15:47
I don't know what you're doing wrong. If I run the posted script against a file containing the test data you've posted, I get back the output displayed. Same result if I run it from the ISE or calling it from a PS console. I made an update to the script that will make it ouput either True or False depending on whether it found a match. Try that. If it ouputs "False" then we aren't working with the same data. – mjolinor Mar 16 '13 at 16:48
OK I found out what it was. My real file is actually left justified. So "Names" is all the way to the left. But when "Codes" runs over multiple lines it is indented. The good news is I got your code to work and I updated my sample file above to try and reflect this better. One thing I can not figure out is this. If I need to skip down multiple lines and then match another field that is multi lines how do I do that? Take for instance my "Comments" field. I would love to capture those lines and if possible ignore the line beginning with "-----". Is that easy to do here? – J. S. Mar 16 '13 at 18:51

score 0 · Answer 5 · answered Mar 13 '13 at 21:25

According the fact that c:\temp\file.txt contains :

Name John Smith
Dept Accounting
Codes bas-2349,cav-3928,deg-3942
      iye-2830,tel-3890
Urls hxxp://blah.com
     hxxp://foo.com
     hxxp://foo2.com
Some text I dont care about
More text i dont care about
.
.
Date 3/12/2013

You can use regular expressions like this :

$a = Get-Content C:\temp\file.txt
$b = [regex]::match($a, "^.*Codes (.*)Urls (.*)Some.*$", "Multiline")
$codes = $b.groups[1].value -replace '[ ]{2,}',','
$urls = $b.groups[2].value -replace '[ ]{2,}',','

Powershell parse parts of a text file and save to CSV

5 Answers5