0

I'm trying to pull data from an HTML source, to create a list of books and authors.

As each book has its own HTML page, I'm using regex method to get the information I require.

Using the following sample of code, I can successfully call $regexp to return the book title (eg. 'My First Cook Book') when I use

>> $regexp = '<title>listing for - (?<title>.*) \[.*\]'
>> $name = ($url | select-string $regexp -allmatches).matches
>> $name.groups[1].value
My First Cook Book

However, I cannot retrieve the Author using a similar method, and I'm assuming it must be due to the code being spread across multiple lines, or to the inclusion of non-textual characters.

>> $regex1 = '<td class="tboldc" width="170">&nbsp; Author:</td>
>> <td class="tnormg" width="*">&nbsp;(?<author>.*)</td>'

>> $name1 = ($url | select-string $regex1 -allmatches).matches
>> $name.groups[1].value
Cannot index into a null array.
At line:1 char:1
+ $name1.groups[1].value     

I would like to retrieve the author's name (in this case 'D Atherton')

Where am I going wrong?

I've tried placing double-quotes around the & characters ( "&" ) and to place my (?.*) at different locations along the code (which gets varying results, but only seems to be when a single line of source code is used). [I'm assuming I need both lines of code, so that I can determine the ' Author:' part of the code in the regex, and the desired result from the second line]

[Solved]

Thank you to all who suggested alternate ways of solving this one. I can finally say, however, that I think I've solved it whilst sticking to using Powershell regex.

I replaced the $regex1 line with

$regex1 = '(?s) Author:<\/td>(?<author>.*?)<\/td' 

And used the following line to give me my required Author name as a result:

$author = $name1.groups[1].value -creplace '^[^\;]*\;', '' 

Phew!

CJ Perry
  • 1
  • 2
  • 1
    Why would you use regex instead of a dom parser? – Santiago Squarzon Mar 25 '23 at 16:57
  • 1
    See e.g.: [Powershell regex multiple match per line](https://stackoverflow.com/a/72507549/1701026) – iRon Mar 25 '23 at 17:46
  • You're not showing how `$url` is populated. Unless it is a _multi-line_ string, you won't be able to match _across lines_. – mklement0 Mar 25 '23 at 21:35
  • $url = Invoke-RestMethod -uri "https://samplewebpage.com/bookid" the other regex lines work ok, but this one has me stumped! – CJ Perry Mar 25 '23 at 22:31
  • 1
    Regex is for Regular Expression and HTML is not regular. When you get nested HTML data Regex will not work due to the recursions. Use a HTML parser library instead. – jdweng Mar 26 '23 at 08:50
  • Thank you to all who suggested alternate ways of solving this one. I can finally say, however, that I think I've solved it whilst stickling to using Powershell regex . I replaced the $regex1 line with '(?s) Author:<\/td>(?.*?)<\/td' And used the following line to give me my required Author name as a result: $author = $name1.groups[1].value -creplace '^[^\;]*\;', '' Phew! – CJ Perry Mar 26 '23 at 13:42

0 Answers0