Regex match Improvement

Question

I have this text:

<td class="devices-user-name">devicename</td>
            <td>192.168.133.221</td>
            <td>Storage Sync</td>
            <td>10.3.3.335</td>
            <td>Active</td>
            <td>7/26/2016 8:39PM</td>
            <td class="devices-details-button"><a class="btn btn-mini" href="#settings/devices/1/239a9cd0-d6c9-4e7d-9918-0cd686a57aac">Details</a></td>

I want to catch everything between the <td> </td> as well the <td class=...> </td>

What I achieved is this regex:

<td.*>(.*?)<\/td>(\n(.*<td>(.*?)<\/td>))(\n(.*<td>(.*?)<\/td>))(\n(.*<td>(.*?)<\/td>))(\n(.*<td>(.*?)<\/td>))(\n(.*<td>(.*?)<\/td>))(\n(.*<td.*href="(.*?)"))

After that I still need to exclude all the <td> matches:

$MatchResult = $Matches.GetEnumerator() | ? {$_.Value -notmatch 'td'} | Sort Name

Finally I get this results:

Name                           Value
----                           -----
1                              devicename
4                              192.168.133.221
7                              Storage Sync
10                             10.3.3.335
13                             Active
16                             7/26/2016 8:39PM
19                             #settings/devices/1/239a9cd0-d6c9-4e7d-9918-0cd686a57aac

But I'm quiet sure that there's a better way, instead of duplicating the groups, excluding stuff etc. to use some other/better technics, which I'll be happy to learn.

What is your suggestion?

See http://stackoverflow.com/a/11656434/3832970 for an alternative method. — Wiktor Stribiżew, Jul 27 '16 at 07:06
Concerning parsing HTML with RegEx, [please read this first](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Joram van den Boezem, Jul 27 '16 at 07:21

Martin Brandl · Answer 1 · 2016-07-27T07:18:31.933

2

You can use [regex]::Matches to get multiple matches (instead of using \n):

$content = Get-Content 'your-File'
[regex]::Matches($content , '<td.*?>(.+?)<\/td>') | ForEach-Object {
    $_.Groups[1].Value
}

Regex:

<td.*?>(.+?)<\/td>

Regular expression visualization

Output:

devicename
192.168.133.221
Storage Sync
10.3.3.335
Active
7/26/2016 8:39PM
<a class="btn btn-mini" href="#settings/devices/1/239a9cd0-d6c9-4e7d-9918-0cd686a57aac">Details</a>

Note: You probably want to extract the href in another step or by adjusting the regex - but you question was about catching everything between <td>...

edited Jul 27 '16 at 07:18

answered Jul 27 '16 at 07:13

Martin Brandl

56,134
13
133
172

1

`]*>(.+)<\/td>` For the example provided, this works in about 1/3rd the number of steps, as lazy evaluations are notoriously slow. It will work as long as each `` is on it's own line, as `.` won't normally consume newlines. Just depends on the source being parsed. – TemporalWolf Jul 27 '16 at 07:32
Well mentioned. If he want to stick with regex to parse his html? he probably have to add `[System.Text.RegularExpressions.RegexOptions]`... – Martin Brandl Jul 27 '16 at 07:44
Can i use something like `'(.+?)<\/td>'{3}` for 3 times for example? – JustCurious Jul 28 '16 at 09:15

Regex match Improvement

1 Answers1