I'm trying to pull data from an HTML source, to create a list of books and authors.
As each book has its own HTML page, I'm using regex method to get the information I require.
Using the following sample of code, I can successfully call $regexp to return the book title (eg. 'My First Cook Book') when I use
>> $regexp = '<title>listing for - (?<title>.*) \[.*\]'
>> $name = ($url | select-string $regexp -allmatches).matches
>> $name.groups[1].value
My First Cook Book
However, I cannot retrieve the Author using a similar method, and I'm assuming it must be due to the code being spread across multiple lines, or to the inclusion of non-textual characters.
>> $regex1 = '<td class="tboldc" width="170"> Author:</td>
>> <td class="tnormg" width="*"> (?<author>.*)</td>'
>> $name1 = ($url | select-string $regex1 -allmatches).matches
>> $name.groups[1].value
Cannot index into a null array.
At line:1 char:1
+ $name1.groups[1].value
I would like to retrieve the author's name (in this case 'D Atherton')
Where am I going wrong?
I've tried placing double-quotes around the & characters ( "&" ) and to place my (?.*) at different locations along the code (which gets varying results, but only seems to be when a single line of source code is used). [I'm assuming I need both lines of code, so that I can determine the ' Author:' part of the code in the regex, and the desired result from the second line]
[Solved]
Thank you to all who suggested alternate ways of solving this one. I can finally say, however, that I think I've solved it whilst sticking to using Powershell regex.
I replaced the $regex1 line with
$regex1 = '(?s) Author:<\/td>(?<author>.*?)<\/td'
And used the following line to give me my required Author name as a result:
$author = $name1.groups[1].value -creplace '^[^\;]*\;', ''
Phew!