0

I've got the following code:

<tr class="even">
            <td>
                Title1
            </td>
            <td>
                Name1
            </td>
            <td>
                Email1
            </td>
            <td>
                Postcode1
            </td>

I want to use RegEx in to output the data between the tags like so:

Title1 Name1 Email1 Postcode1 Title2 Name2 Email2 Postcode2 ...

Hoyesic
  • 53
  • 1
  • 1
  • 5

3 Answers3

1

You shouldn't use a regex to parse html, use an HTML parser instead.

Anyway, if you really want a regex you can use this one:

>\s+<|>\s*(.*?)\s*<

Working demo

enter image description here Match information:

MATCH 1
1.  [51-57] `Title1`
MATCH 2
1.  [109-114]   `Name1`
MATCH 3
1.  [166-172]   `Email1`
MATCH 4
1.  [224-233]   `Postcode1`
Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
1

This should get rid of everything between the tags, and output the rest space separated:

$text = 
@'
<tr class="even">
            <td>
                Title1
            </td>
            <td>
                Name1
            </td>
            <td>
                Email1
            </td>
            <td>
                Postcode1
            </td>
'@

$text -split '\s*<.+?>\s*' -match '\S' -as [string]

Title1 Name1 Email1 Postcode1
mjolinor
  • 66,130
  • 7
  • 114
  • 135
0

Don't use a regex. HTML isn't a regular language, so it can't be properly parsed with a regex. It will succeed most of the time, but other times will fail. Spectacularly.

Use the Internet Explorer COM object to read your HTML from a file:

$ie = new-object -com "InternetExplorer.Application"
$ie.visible = $false
$ie.navigate("F:\BuildOutput\rt.html")
$document = $ie.Document
# This will return all the tables
$document.getElementsByTagName('table')

# This will return a table with a specific ID
$document.getElementById('employees')

Here's the MSDN reference for the document class.

Community
  • 1
  • 1
Aaron Jensen
  • 25,861
  • 15
  • 82
  • 91