0

I've saved the HTML source from a page full of membership information. There are a half-dozen bits of info for each of 100 members scattered systematically (oxymoron, right?) through the HTML code. I've analyzed the structure of the page and I've used Just Great Software's RegexBuddy to puzzle out a series of regex replace operations that leave me with a tab-delimited list of member names, cities, etc. That works just fine.

Now I'd like to script that series of regexes in PowerShell. In RegexBuddy I used .NET-style regexes; so I figured that they'd transfer over to PowerShell without any trouble.

I put together a single PowerShell command that starts with a Get-Content cmdlet and pipes it into a series of Foreach-Object processes. There are ELEVEN Foreach-Object processes in a bucket brigade, concluding with a Set-Content cmdlet to write the output to another text file. Like this:

(Get-Content "C:\Temp\inputfile.txt") 
| Foreach-Object { $_ -replace '<search string 1>', '<replace string 1>' } 
| Foreach-Object { $_ -replace '<search string 2>', '<replace string 2>' } 
| ... rinse and repeat ... | Set-Content "C:\Temp\outputfile.txt"

(All of the above code is in one line.) What I'm seeing is that this script stops after the first -replace.

Is piping a dozen times too much? Should I perhaps save the output of each Foreach-Object cmdlet into another variable and feed that into the next Foreach-Object, each a separate PowerShell command?

~~~~~~~~~~~~~~~~~~~~~~~~ I should have added a sample of the HTML I'm trying to parse with PowerShell. In this sample I'm looking to extract the items demarcated with %%. Again, I've puzzled out the regexes with RegexBuddy and I can get what I want with it as long as I'm willing to copy-and-paste each intermediate result back into the main search window.

<div id="DisplayNamePanel_%%13%%" class="member-name">
<a id="DisplayName_13" biokey="70fc74b3-bf94-4f5f-8474-c9c7d680a4bc" 
    href="%%/engage/member-directory/members/profile/?UserKey=70fc74b3-bf94-
    4f5f-8474-c9c7d680a4bc%%">
%%Amy Armstrong%%</a>
</div>
</td>
<td>
<div id="CompanyNamePanel_13" class="company-name">
%%Acme Computers Inc%%
</div>
<div id="CompanyTitlePanel_13" class="company-title">
%%Sales Coordinator/Customer Service Manager%%
</div>
<div id="Addr1Panel_13">
<div class="map-icon">
<a id="Addr1MapLink_13" 
    href="http://maps.google.com/?q=East+Middleton, WI, 54444" 
    target="_blank">
    <img id="MapIcon_13" src="https://d2x5ku95GoogleMapIcon1.png" 
    alt="Google Map Icon" /></a>
</div>
<div class="list-address-panel">
<div id="Contacts_Addr1Panel4_13">
%%East Middleton, WI%%
</div>
<div id="MainCopy_ctl33_FindContacts_Contacts_Addr1Panel5_13">
United States
</div>
<div id="MainCopy_ctl33_FindContacts_Contacts_Addr1PhonePanel_13">
.
.
.
</tr>
<tr>
<td>
<table id="ImageContainer_14" cellpadding="0" cellspacing="0">
<tr>
<td style="border:none;">
etc.... for a hundred blocks of HTML code.

I've abbreviated the HTML code somewhat to strip out the additional .NET element flags; but this is an accurate picture of the code I'm parsing.

My regexes find bits of the information highlighted above and make them stand out from the rest of the HTML code by adding "MEMBID: " to the beginning of the lines that contain the City and State, Company name, Title, etc. Then I look for those flagged lines and additional regexes string all of those lines together. At the end I clean up and get rid of all the lines that don't begin with MEMBID.

Here are my regexes (the sample code has already had the first one run against it to remove beginning-of-line spaces... lots of those!):

Foreach-Object { $_ -replace '^\s+','' };
Foreach-Object { $_ -replace '^[^\n]*?DisplayNamePanel_(?<membid>\d+)
    " class="member-name">\n^[^\n]*?href="(?<href>[^\n]*?)">(?<name>[^\n]*?)
    <[^\n]*?$'
    , 'MEMBID: ${membid}    HREF: ${href}   NAME: ${name}' };
Foreach-Object { $_ -replace '^<div[^\n]*NamePanel_(?<id>\d+)[^\n]*?
    class="company-name">\n^(?<company>[^\n]*?)$'
    , 'MEMBID: ${id}    COMPANY: ${company}' };
Foreach-Object { $_ -replace '^<div[^\n]*?Addr1Panel4_(?<id>\d+)[^\n]*?
    $\n^(?<cityst>[^\n]*?)$'
    , 'MEMBID: ${id}    CITYSTATE: ${cityst}' };
Foreach-Object { $_ -replace '^<div[^\n]*TitlePanel_(?<id>\d+)[^\n]*?
    class="company-title">\n^(?<title>[^\n]*?)$'
    , 'MEMBID: ${id}    TITLE: ${title}' };
Foreach-Object { $_ -replace '^(?<memb>MEMBID: \d+)\t(?<href>HREF: 
    [^\n]*?)UserKey=(?<user>[^\n]*?)\t(?<name>[^\n]*?)$'
    , '${memb}  ${href}UserKey=${user}  ${name}<<<HRT>>>
    ${memb} USERKEY: ${user}' };
Foreach-Object { $_ -replace '^[^(MEMB)].*?$\n', '' };
Foreach-Object { $_ -replace '^MEMBID: ', '' };
Foreach-Object { $_ -replace '^(?<id>\d+)\tHREF: (?<href>[^\n]*?)\tNAME: (?
    <name>[^\n]*?$)\n\d+\tUSERKEY: (?<user>[^\n]*?)'
    , '${id}    ${href} ${name} ${user}' };
Foreach-Object { $_ -replace '^(?<name>\d+\t/[^\n]*?)\n\d+\tCOMPANY: (?<co>
    [^\n]*?)$\n\d+\tCITYSTATE: (?<city>[^\n]*?)$'
    , '${name}  ${co}       ${city}' };
Foreach-Object { $_ -replace '^(?<name>\d+\t/[^\n]*?)\n\d+\tCOMPANY: (?<co>
    [^\n]*?)$\n\d+\tTITLE: (?<title>[^\n]*?)$\n\d+\tCITYSTATE: (?<city>
    [^\n]*?)$'
    , '${name}  ${co}   ${title}    ${city}' };

The HTML code is almost completely consistent. Towards the end of my regexes I have to allow for the fact that about half of the members don't have Titles; and the HTML code doesn't have an empty 'class="company-title"' at those spots. Oh, well!

There is one spot where a Hard Return is part of the REPLACE string. I've designated that with <<>>.

Again, I apologize for not giving the extra detail.

serbach
  • 68
  • 6
  • It's hard to troubleshoot without a sample text, regexes, output and expected output. Ex. Maybe the changes your first regex modifies the text so the second doesn't match anymore. You can combine all replace-operations in one foreach `$_ -replace 'foo', 'bar' -replace 'bar', 'foo'` – Frode F. Mar 18 '16 at 21:21
  • 3
    [Parsing HTML with regular expressions ...](http://stackoverflow.com/a/1732454/1630171) – Ansgar Wiechers Mar 18 '16 at 21:22
  • 2
    HTML + regex = No. Remote HTML (website) + `Invoke-WebRequest` = Yes. Local HTML (saved file) + HTML Agility Pack(http://htmlagilitypack.codeplex.com/) or `InternetExplorer.Application`-comobject = Yes – Frode F. Mar 18 '16 at 22:11
  • 1
    Whenever I have to parse HTML with Powershell I import the html as xml so I get easy to work with objects. After that I do the stuff that's needed (inject stuff or extract data) – bluuf Mar 20 '16 at 17:53
  • Ansgar, I read that link with pleasure! Thanks for posting that. Proverbs 26:11 certainly describes me! – serbach Mar 21 '16 at 10:37
  • bluuf, That's a good idea. I usually approach this sort of thing the hard way... I'll try your suggestion. Thank you. – serbach Mar 21 '16 at 10:46
  • Frode F., I will look into your suggestions. Thank you. I'm also looking into screen/web scraping apps. – serbach Mar 21 '16 at 10:48

2 Answers2

1

EDIT: Try this:

Get-Content "C:\Temp\inputfile.txt")
| Foreach-Object { $_ -replace '<search string 1>', '<replace string 1>' `
-replace '<replace string 2', '<replace string 2>' ` 
-replace '<replace string 3', '<replace string 3>' ` 
... rinse and repeat ... } | Set-Content "C:\Temp\outputfile.txt"

This combines all of the replace functions to one foreach. Not sure if it will help, but it might.

I ran some tests in PowerShell and it seems that based on your script example, it should all work. The only thing I can think of is there is a problem parsing the input file.

Original Post:

Can you post a sample of the input file? Make sure to anonymize any user data like usernames and passwords. Here is what I tested with:

PS C:\Users\bmcnab> Write-Output "Hello
>> Goodbye" > test.txt

PS C:\Users\bmcnab> Get-Content test.txt
Hello
Goodbye

PS C:\Users\bmcnab> Get-Content test.txt) | foreach { $_ -replace 'H','h' } | foreach { $_ -replace 'G','g' } | Set-Content "test1.txt"

PS C:\Users\bmcnab> Get-Content .\test1.txt
hello
goodbye

I do not believe that piping 11 times would cause any problems. There may certainly be more efficient ways to do it, but these operations are not CPU intensive, so it really shouldn't be a problem.

  • 1
    You get overhead for every foreach-call. Combining the replace operations inside one foreach is better and doesn't require any modifications to the regex. If the input follows the same pattern you could also make a single regex that does it all. :-) – Frode F. Mar 18 '16 at 21:28
  • Good idea @FrodeF. . I didn't think about that. If there is something going on where his foreach is getting stuck _after_ the first one, then this may solve it. –  Mar 18 '16 at 21:38
0

After further review, and confirming that the HTML code for the pages I'm attempting to scrape is formatted consistently enough, I cut down the number of regexes from 11 or 12 to 4. I haven't stuffed them all into a PowerShell script yet (still using RegexBuddy). I feel sure that it's doable but since it isn't an everyday thing, I may stick with RegexBuddy.

I will take Frode F.'s suggestion to download the HTML Agility Pack. I'm also looking at Tidy for conversion of HTML to XML.

Thank you all for your comments. I've printed bobince's HTML-parsing-with-regexes rant for the education and amusement of our IT department.

serbach
  • 68
  • 6