I've saved the HTML source from a page full of membership information. There are a half-dozen bits of info for each of 100 members scattered systematically (oxymoron, right?) through the HTML code. I've analyzed the structure of the page and I've used Just Great Software's RegexBuddy to puzzle out a series of regex replace operations that leave me with a tab-delimited list of member names, cities, etc. That works just fine.
Now I'd like to script that series of regexes in PowerShell. In RegexBuddy I used .NET-style regexes; so I figured that they'd transfer over to PowerShell without any trouble.
I put together a single PowerShell command that starts with a Get-Content cmdlet and pipes it into a series of Foreach-Object processes. There are ELEVEN Foreach-Object processes in a bucket brigade, concluding with a Set-Content cmdlet to write the output to another text file. Like this:
(Get-Content "C:\Temp\inputfile.txt")
| Foreach-Object { $_ -replace '<search string 1>', '<replace string 1>' }
| Foreach-Object { $_ -replace '<search string 2>', '<replace string 2>' }
| ... rinse and repeat ... | Set-Content "C:\Temp\outputfile.txt"
(All of the above code is in one line.) What I'm seeing is that this script stops after the first -replace.
Is piping a dozen times too much? Should I perhaps save the output of each Foreach-Object cmdlet into another variable and feed that into the next Foreach-Object, each a separate PowerShell command?
~~~~~~~~~~~~~~~~~~~~~~~~ I should have added a sample of the HTML I'm trying to parse with PowerShell. In this sample I'm looking to extract the items demarcated with %%. Again, I've puzzled out the regexes with RegexBuddy and I can get what I want with it as long as I'm willing to copy-and-paste each intermediate result back into the main search window.
<div id="DisplayNamePanel_%%13%%" class="member-name">
<a id="DisplayName_13" biokey="70fc74b3-bf94-4f5f-8474-c9c7d680a4bc"
href="%%/engage/member-directory/members/profile/?UserKey=70fc74b3-bf94-
4f5f-8474-c9c7d680a4bc%%">
%%Amy Armstrong%%</a>
</div>
</td>
<td>
<div id="CompanyNamePanel_13" class="company-name">
%%Acme Computers Inc%%
</div>
<div id="CompanyTitlePanel_13" class="company-title">
%%Sales Coordinator/Customer Service Manager%%
</div>
<div id="Addr1Panel_13">
<div class="map-icon">
<a id="Addr1MapLink_13"
href="http://maps.google.com/?q=East+Middleton, WI, 54444"
target="_blank">
<img id="MapIcon_13" src="https://d2x5ku95GoogleMapIcon1.png"
alt="Google Map Icon" /></a>
</div>
<div class="list-address-panel">
<div id="Contacts_Addr1Panel4_13">
%%East Middleton, WI%%
</div>
<div id="MainCopy_ctl33_FindContacts_Contacts_Addr1Panel5_13">
United States
</div>
<div id="MainCopy_ctl33_FindContacts_Contacts_Addr1PhonePanel_13">
.
.
.
</tr>
<tr>
<td>
<table id="ImageContainer_14" cellpadding="0" cellspacing="0">
<tr>
<td style="border:none;">
etc.... for a hundred blocks of HTML code.
I've abbreviated the HTML code somewhat to strip out the additional .NET element flags; but this is an accurate picture of the code I'm parsing.
My regexes find bits of the information highlighted above and make them stand out from the rest of the HTML code by adding "MEMBID: " to the beginning of the lines that contain the City and State, Company name, Title, etc. Then I look for those flagged lines and additional regexes string all of those lines together. At the end I clean up and get rid of all the lines that don't begin with MEMBID.
Here are my regexes (the sample code has already had the first one run against it to remove beginning-of-line spaces... lots of those!):
Foreach-Object { $_ -replace '^\s+','' };
Foreach-Object { $_ -replace '^[^\n]*?DisplayNamePanel_(?<membid>\d+)
" class="member-name">\n^[^\n]*?href="(?<href>[^\n]*?)">(?<name>[^\n]*?)
<[^\n]*?$'
, 'MEMBID: ${membid} HREF: ${href} NAME: ${name}' };
Foreach-Object { $_ -replace '^<div[^\n]*NamePanel_(?<id>\d+)[^\n]*?
class="company-name">\n^(?<company>[^\n]*?)$'
, 'MEMBID: ${id} COMPANY: ${company}' };
Foreach-Object { $_ -replace '^<div[^\n]*?Addr1Panel4_(?<id>\d+)[^\n]*?
$\n^(?<cityst>[^\n]*?)$'
, 'MEMBID: ${id} CITYSTATE: ${cityst}' };
Foreach-Object { $_ -replace '^<div[^\n]*TitlePanel_(?<id>\d+)[^\n]*?
class="company-title">\n^(?<title>[^\n]*?)$'
, 'MEMBID: ${id} TITLE: ${title}' };
Foreach-Object { $_ -replace '^(?<memb>MEMBID: \d+)\t(?<href>HREF:
[^\n]*?)UserKey=(?<user>[^\n]*?)\t(?<name>[^\n]*?)$'
, '${memb} ${href}UserKey=${user} ${name}<<<HRT>>>
${memb} USERKEY: ${user}' };
Foreach-Object { $_ -replace '^[^(MEMB)].*?$\n', '' };
Foreach-Object { $_ -replace '^MEMBID: ', '' };
Foreach-Object { $_ -replace '^(?<id>\d+)\tHREF: (?<href>[^\n]*?)\tNAME: (?
<name>[^\n]*?$)\n\d+\tUSERKEY: (?<user>[^\n]*?)'
, '${id} ${href} ${name} ${user}' };
Foreach-Object { $_ -replace '^(?<name>\d+\t/[^\n]*?)\n\d+\tCOMPANY: (?<co>
[^\n]*?)$\n\d+\tCITYSTATE: (?<city>[^\n]*?)$'
, '${name} ${co} ${city}' };
Foreach-Object { $_ -replace '^(?<name>\d+\t/[^\n]*?)\n\d+\tCOMPANY: (?<co>
[^\n]*?)$\n\d+\tTITLE: (?<title>[^\n]*?)$\n\d+\tCITYSTATE: (?<city>
[^\n]*?)$'
, '${name} ${co} ${title} ${city}' };
The HTML code is almost completely consistent. Towards the end of my regexes I have to allow for the fact that about half of the members don't have Titles; and the HTML code doesn't have an empty 'class="company-title"' at those spots. Oh, well!
There is one spot where a Hard Return is part of the REPLACE string. I've designated that with <<>>.
Again, I apologize for not giving the extra detail.