I'm doing a little web scraping with Powershell. There was an item on the site with the code like this:
<h1 class="">1001 Nights <span id="titleYear">(<a href="/year/1968/?ref_=tt_ov_inf">1968</a>)</span> </h1>
And I want to extract the text inside, this text:
1001 Nights
but not this text:
<span id="titleYear">(<a href="/year/1968/?ref_=tt_ov_inf">1968</a>)</span>
And the CSS selector of on the site is something like:
"#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1"
Doing some search on Stack Overflow, I found the code for the job as below.
$Result = Invoke-WebRequest -Uri "https://www.imdb.com/title/tt0062940/?ref_=ttls_li_tt"
$movieTitleSelector = "#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1"
$NodeList = $Result.ParsedHtml.querySelectorAll( $movieTitleSelector)
$PsNodeList = @()
for ($i = 0; $i -lt $NodeList.Length; $i++) {
$PsNodeList += $NodeList.item($i)
}
$PsNodeList | ForEach-Object {
$_.InnerText
}
And the result is:
1001 Nights (1968)
The "1001 Nights" is the movie title and "1968" is the release year which was included inside the <span></span>
. I just want the title part not the release year part. I found some code on Stack Overflow which say I can exclusively choose the text in the <h1>
tag not inside the <span>
by change the code in the text above to:
$movieTitleSelector = "#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1 :not(span)"
But when I run the code, it throw the
Invalid argument.
At line:1 char:1
+ $NodeList = $Result.ParsedHtml.querySelectorAll( "#title-overview-wi ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : OperationStopped: (:) [], ArgumentException
+ FullyQualifiedErrorId : System.ArgumentException
error. I think the error occurred because there was a colon in the $movieTitleSelector string but I'm not very sure. Anyone please tell me how can I get the title text in the <h1>
element but not inside the <span>
tag.
Thank you.