How to extract certain data from HTML using RegEx?

Question

I've got the following code:

<tr class="even">
            <td>
                Title1
            </td>
            <td>
                Name1
            </td>
            <td>
                Email1
            </td>
            <td>
                Postcode1
            </td>

I want to use RegEx in to output the data between the tags like so:

Title1 Name1 Email1 Postcode1 Title2 Name2 Email2 Postcode2 ...

[Dare I say it?](http://stackoverflow.com/a/1732454/102937) – Robert Harvey Sep 03 '14 at 15:21 — Robert Harvey, Sep 03 '14 at 15:21

score 1 · Answer 1 · answered Sep 03 '14 at 15:23

You shouldn't use a regex to parse html, use an HTML parser instead.

Anyway, if you really want a regex you can use this one:

>\s+<|>\s*(.*?)\s*<

Working demo

enter image description here Match information:

MATCH 1
1.  [51-57] `Title1`
MATCH 2
1.  [109-114]   `Name1`
MATCH 3
1.  [166-172]   `Email1`
MATCH 4
1.  [224-233]   `Postcode1`

score 1 · Answer 2 · answered Sep 03 '14 at 15:53

This should get rid of everything between the tags, and output the rest space separated:

$text = 
@'
<tr class="even">
            <td>
                Title1
            </td>
            <td>
                Name1
            </td>
            <td>
                Email1
            </td>
            <td>
                Postcode1
            </td>
'@

$text -split '\s*<.+?>\s*' -match '\S' -as [string]

Title1 Name1 Email1 Postcode1

score 0 · Answer 3 · edited May 23 '17 at 12:12

Don't use a regex. HTML isn't a regular language, so it can't be properly parsed with a regex. It will succeed most of the time, but other times will fail. Spectacularly.

Use the Internet Explorer COM object to read your HTML from a file:

$ie = new-object -com "InternetExplorer.Application"
$ie.visible = $false
$ie.navigate("F:\BuildOutput\rt.html")
$document = $ie.Document
# This will return all the tables
$document.getElementsByTagName('table')

# This will return a table with a specific ID
$document.getElementById('employees')

Here's the MSDN reference for the document class.

How to extract certain data from HTML using RegEx?

3 Answers3