0

I´m still crawling in Powershell so decided to ask after trying without being successful.

I have a HTML code like below. I need to extract the Chile word which is present on tr tag and all values present on td tags and export it to a .txt file.

Using the code below it works perfectly BUT it´s depending on the font color:

$result = [regex]::Matches($content, 'style="color&#58;black;".*?>(.*?)</span>')
$result | select { ($_.Groups[1].Value -replace '&#160;', '' -replace '​', '').Trim().Trim(',')} | Out-file $outfile -Encoding ascii

As you can see on HTML code, some columns (TD) does not have the pattern

How can I get these values in Powershell? I´ve tried below options but no luck:

$result = [regex]::Matches($content, 'style="windowtext;".*?>(.*?)</td>')
$result | select { ($_.Groups[1].Value -replace '&#160;', '').Trim().Trim(',')} | Out-file $outfile

$result = [regex]::Matches($content, '<td.*?>(.+)</td>')

$result = [regex]::Matches($content, '<td.*?>(.*?)</td>') | % { $_.Captures[0].Groups[1].value} | Out-file $outfile

Again, I need to extract the Chile word which is present on TR tag and all values present on TD tags and export it to a .TXT file.

   <tr class="ms-rteFontSize-1 ms-rteTableOddRow-1" dir="rtl" style="height&#58;15pt;"><th class="ms-rteTableFirstCol-1" rowspan="1" colspan="1" style="border-    width&#58;medium 1pt 1pt;border-style&#58;none solid solid;padding&#58;0in 5.4pt;width&#58;100px;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-    color&#58;windowtext;border-left-color&#58;windowtext;"><div><b><span style="color&#58;black;">Chile</span></b></div></th>
<td width="64" class="ms-rteTableOddCol-1" valign="bottom" style="border-width&#58;medium 1pt 1pt medium;border-style&#58;none solid solid none;padding&#58;0in 5.4pt;width&#58;48pt;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-color&#58;windowtext;">2</td>
<td class="ms-rteTableEvenCol-1" valign="bottom" style="border-width&#58;medium 1pt 1pt medium;border-style&#58;none solid solid none;padding&#58;0in 5.4pt;width&#58;66px;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-color&#58;windowtext;">&#160;</td>
<td class="ms-rteTableOddCol-1" valign="bottom" style="border-width&#58;medium 1pt 1pt medium;border-style&#58;none solid solid none;padding&#58;0in 5.4pt;width&#58;81px;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-color&#58;windowtext;">&#160;</td>
<td width="64" class="ms-rteTableEvenCol-1" valign="bottom" style="border-width&#58;medium 1pt 1pt medium;border-style&#58;none solid solid none;padding&#58;0in 5.4pt;width&#58;48pt;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-color&#58;windowtext;">14,19</td>
<td width="64" class="ms-rteTableOddCol-1" valign="bottom" style="border-width&#58;medium 1pt 1pt medium;border-style&#58;none solid solid none;padding&#58;0in 5.4pt;width&#58;48pt;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-color&#58;windowtext;"><div><span style="color&#58;black;">1</span></div></td>
<td width="64" class="ms-rteTableEvenCol-1" valign="bottom" style="border-width&#58;medium 1pt 1pt medium;border-style&#58;none solid solid none;padding&#58;0in 5.4pt;width&#58;48pt;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-color&#58;windowtext;"><div><span style="color&#58;black;">26</span></div></td>
<td width="64" class="ms-rteTableOddCol-1" valign="bottom" style="border-width&#58;medium 1pt 1pt medium;border-style&#58;none solid solid none;padding&#58;0in 5.4pt;width&#58;48pt;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-color&#58;windowtext;">&#160;</td>
<td width="64" class="ms-rteTableEvenCol-1" valign="bottom" style="border-width&#58;medium 1pt 1pt medium;border-style&#58;none solid solid none;padding&#58;0in 5.4pt;width&#58;48pt;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-color&#58;windowtext;"><div><span style="color&#58;black;">15</span></div></td>
<td class="ms-rteTableOddCol-1" valign="bottom" style="border-width&#58;medium 1pt 1pt medium;border-style&#58;none solid solid none;padding&#58;0in 5.4pt;width&#58;80px;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-color&#58;windowtext;"><div><span style="color&#58;black;">18,19</span></div></td>
<td width="64" class="ms-rteTableEvenCol-1" valign="bottom" style="border-width&#58;medium 1pt 1pt medium;border-style&#58;none solid solid none;padding&#58;0in 5.4pt;width&#58;48pt;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-color&#58;windowtext;"><div><span style="color&#58;black;">9,27</span></div></td>
<td class="ms-rteTableOddCol-1" valign="bottom" style="border-width&#58;medium 1pt 1pt medium;border-style&#58;none solid solid none;padding&#58;0in 5.4pt;width&#58;80px;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-color&#58;windowtext;"><div><span style="color&#58;black;">1</span></div></td>
<td class="ms-rteTableEvenCol-1" valign="bottom" style="border-width&#58;medium 1pt 1pt medium;border-style&#58;none solid solid none;padding&#58;0in 5.4pt;width&#58;80px;height&#58;15pt;border-right-color&#58;windowtext;border-bottom-color&#58;windowtext;"><div><span style="color&#58;black;">8,25</span></div></td></tr>
DevHawk
  • 107
  • 1
  • 1
  • 10
  • You need to post the relevant lines from the file you are attempting to parse. While there is no doubt what `` and `` tags are, your individual use and *style* or *other format* specifiers may make a difference. See: [**How to create a Minimal, Complete, and Verifiable example**](http://stackoverflow.com/help/mcve). – David C. Rankin Jul 14 '17 at 00:13
  • Is this a complete HTML document you're parsing, or only a portion of one? Alas, you [can't parse HTML with regex](https://stackoverflow.com/a/1732454/1324345), so hopefully you've got a full HTML document (because then you can get at the DOM tree and walk down that). – alroc Jul 14 '17 at 01:45
  • In fact, just the portion I´m working on. Only the portion I need to extract data from. – DevHawk Jul 14 '17 at 15:33

1 Answers1

1

I have to make some assumptions here to provide you with an answer. I'm assuming that your are working with an complete HTML document. If you are not then please update your requirements as it might be easier to just treat your document as XML.

Retrieve that document with invoke-webrequest:

$html = invoke-webrequest "http://www.yourpath.here"

Now I am going to assume you are working with content that has only 1 table on that page. This will get the first table on the returned document. Should you not want the first table you can either change the index or you can use a where clause to select the table you want based on criteria.

$table = $html.parsedHtml.getElementsByTagName("table")[0]

Now because I don't know the entire contents of your table I'm going to assume that "Chile" does not appear anywhere else inside that entire table. This needs to be true as I am going to take a simple approach to ignore all the innerHTML inside your TR. Should this not be the case you will need to implement additional logic to check that you are only reading the TH inside the TR.

$TR = $table.getElementsByTagName("tr") | where { $_.innerText -like "*Chile*" }

Next we can grab all of the TD elements:

$TD = $TR.getElementsByTagName("td")

At this point you have all of the TD objects in an array. You dump the contents with:

$TD | foreach { $_.innerText }

oddly, just doing $TD.innerText will not yield this output.

Ty Savercool
  • 1,132
  • 5
  • 10
  • Hi Ty, thank you very much for the logic explanation, today I found some time to review it and it works great for version 5 but it doesn't for version 4 which is what I currently have on server. I'm trying to figure out how to make it work on version 4 now. – DevHawk Jul 25 '17 at 20:46
  • You are referring to your PowerShell version, correct? I know Invoke-WebRequest has been around since 2.0 but I'll check tomorrow if the parsedHTML functionality was introduced later and if I can find an alternative. – Ty Savercool Jul 25 '17 at 21:45
  • Yep! Correct, a PS version issue. The main goal for me is to be able to get the table content from HTML file and work with the data. I´m using a complete html file. – DevHawk Jul 26 '17 at 16:36
  • Correction, Invoke-WebRequest was introduced in 3.0. That being said, I tested this in 3 and 4 and it is working for me. Can you post the details of your updated code and the error message(s). – Ty Savercool Jul 26 '17 at 17:25
  • I believe I´ve found what the problem is, my server network does not allow me to download contents from web pages so it´s blocking every time I try to access from the server. From local machine it´s fine, but I´ve managed to download using another approach and it´s working good now! :) thanks a lot Ty Savercool for your patience and help, you rock! – DevHawk Jul 27 '17 at 00:04