0

I'm trying to use this code to scrape a web page with regex in PowerShell:

$webClient = New-Object System.Net.WebClient
$data = $webClient.downloadstring($url)
$h1Tag = [regex] '(?i)(?<=<h1 class="mb-0 mb-lg-1 svelte-jcq9ad">)([\S\s]*?)(?=<\/h1>)'
$h1 = $h1Tag.Match($data).value.trim()

Sample text to search:

 <div>
     <h1 class="mb-0 mb-lg-1 svelte-jcq9ad">AdBlock — best ad blocker</h1>
     <h2 class="mb-2 svelte-jcq9ad">Block ads and pop-ups on YouTube, Facebook, Twitch, and your favorite websites.</h2>
  </div>
</div>

It correctly returns AdBlock - best ad blocker when I test the regex expression on a couple of regex test sites, but in PowerShell $h1 is always empty. What am I missing?

Edit: I updated $title to $h1 in my question. $title was a typo on my part - $h1 is what I should have said.

  • 1
    You could treat that as xml if the extra weren't there. `[xml]$xml = cat file.html; $xml.div.h1.'#text'` – js2010 Apr 06 '23 at 00:30
  • 2
    Using a [regular expression](https://en.wikipedia.org/wiki/Regular_expression) to peek and poke in a structured string might give unexpected and greedy results. As [suggested before](https://stackoverflow.com/a/71855426/1701026), it is generally [a bad idea to attempt to parse HTML with regular expressions](https://blog.codinghorror.com/parsing-html-the-cthulhu-way/). Instead use a dedicated HTML parser as the [**HtmlDocument** class](https://learn.microsoft.com/en-us/dotnet/api/system.windows.forms.htmldocument?view=windowsdesktop-6.0), see also: https://stackoverflow.com/a/72507549/1701026 – iRon Apr 06 '23 at 06:01
  • Thanks for fixing the `$title` typo, but the other point still applies: with the sample input given, your code works as expected. Try to find a [mcve]. – mklement0 Apr 06 '23 at 13:33

2 Answers2

0

Try this:

$data = '<div>
     <h1 class="mb-0 mb-lg-1 svelte-jcq9ad">AdBlock — best ad blocker</h1>
     <h2 class="mb-2 svelte-jcq9ad">Block ads and pop-ups on YouTube, Facebook, Twitch, and your favorite websites.</h2>
  </div>
</div>'

$null = $data -match("AdBlock — best ad blocker")
$h1 = $Matches.Values
StephenSo
  • 141
  • 1
  • 4
  • This doesn't use the OP's regex and is otherwise just a reformulation of the OP's own attempt based on using the `-match` operator instead of the `Regex.Match()` .NET API. While showing `-match` as a PowerShell-idiomatic alternative is a good idea in general, it can't be expected to make a difference here. Also, it's best to avoid `(...)` (pseudo method syntax) with PowerShell operators, and it's better to use _single_-quoted strings for regexes. -> `-match 'AdBlock — best ad blocker'` – mklement0 Apr 06 '23 at 14:07
0

First things first:

  • It's best to use a dedicated HTML parser if possible, which enables a more robust solution than a regex-based one, which is invariably brittle - see iRon's comment on the question.

As noted, your regex does work with your sample input, implying that the sample input isn't representative of your actual problem.

The following may solve your problem, because it uses a more flexible reformulation of your regex, and it also showcases the idiomatic way to perform a single regex match in PowerShell, using the -match operator:

$data = '<div>
     <h1 class="mb-0 mb-lg-1 svelte-jcq9ad">AdBlock — best ad blocker</h1>
     <h2 class="mb-2 svelte-jcq9ad">Block ads and pop-ups on YouTube, Facebook, Twitch, and your favorite websites.</h2>
  </div>
</div>'

$h1 = 
  if ($data -match '(?s)(?<=<h1\s+class=[''"]mb-0\s+mb-lg-1\s+svelte-jcq9ad[''"]\s*>)(.*?)(?=</\s*h1>)') {
    # Output the trimmed form of the match, which is stored in entry 0
    # of the automatic $Matches variable.
    $Matches[0].Trim()
  }

# Output the result.
$h1

Note:

  • -match is case-insensitive by default (as are all text-relevant PowerShell operators), so there's no need for the (?i) inline option.

  • However inline option (?s) was added, so as to allow . to match newlines too, obviating the need for the [\s\S] workaround.

  • / never needs escaping (as \/) in PowerShell, given that regexes are specified as normal string literals (inside of which / has no special meaning).

  • The regex has been made more flexible with respect to whitespace (mandatory whitespace is represented as \s+, optional one as \s*, and quoting characters (both ' and " are matched).

  • For a detailed explanation and the ability to experiment with the regex, see this regex101.com page.
    Note: The linked page uses C# string syntax, but the string content is identical to the one above (and both PowerShell and C# use the .NET regex engine).

mklement0
  • 382,024
  • 64
  • 607
  • 775