0

I'm doing a little web scraping with Powershell. There was an item on the site with the code like this:

<h1 class="">1001 Nights&nbsp;<span id="titleYear">(<a href="/year/1968/?ref_=tt_ov_inf">1968</a>)</span>            </h1>

And I want to extract the text inside, this text:

1001 Nights

but not this text:

<span id="titleYear">(<a href="/year/1968/?ref_=tt_ov_inf">1968</a>)</span>

And the CSS selector of on the site is something like:

 "#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1"

Doing some search on Stack Overflow, I found the code for the job as below.

$Result =  Invoke-WebRequest -Uri "https://www.imdb.com/title/tt0062940/?ref_=ttls_li_tt"
$movieTitleSelector = "#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1"
$NodeList = $Result.ParsedHtml.querySelectorAll( $movieTitleSelector)
$PsNodeList = @()
for ($i = 0; $i -lt $NodeList.Length; $i++) { 
   $PsNodeList += $NodeList.item($i)
}
$PsNodeList | ForEach-Object {
   $_.InnerText
}

And the result is:

1001 Nights (1968)

The "1001 Nights" is the movie title and "1968" is the release year which was included inside the <span></span>. I just want the title part not the release year part. I found some code on Stack Overflow which say I can exclusively choose the text in the <h1> tag not inside the <span> by change the code in the text above to:

$movieTitleSelector = "#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1 :not(span)"

But when I run the code, it throw the

Invalid argument.
At line:1 char:1
+ $NodeList = $Result.ParsedHtml.querySelectorAll( "#title-overview-wi ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (:) [], ArgumentException
    + FullyQualifiedErrorId : System.ArgumentException

error. I think the error occurred because there was a colon in the $movieTitleSelector string but I'm not very sure. Anyone please tell me how can I get the title text in the <h1> element but not inside the <span> tag. Thank you.

1 Answers1

0

Why not, just remove the year or whatever string you'd want with regex on the extract.

$movieTitleSelector = "#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1 :not(span)" -replace '\s\(\d{4}\)'

'1001 Nights (1968)' -replace '\s\(\d{4}\)'
<#
# Results

1001 Nights
#>

Update

Try this... based on your response.

$Result =  Invoke-WebRequest -Uri "https://www.imdb.com/title/tt0062940/?ref_=ttls_li_tt"
$movieTitleSelector = "#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1"
$NodeList = $Result.ParsedHtml.querySelectorAll( $movieTitleSelector)

$PsNodeList = @()

for ($i = 0; $i -lt $NodeList.Length; $i++) { 
   $PsNodeList += $NodeList.item($i)
}
$PsNodeList | 
ForEach-Object {
   $_.InnerText -replace '\s\(\d{4}\)' 
}
postanote
  • 15,138
  • 2
  • 14
  • 25
  • Yes that's an option but I hope there's a way in doing it with PowerShell. I hope to learn the thing. – mymicrosoftmylife Mar 03 '20 at 04:12
  • But regex is part of any PowerShell string work, as a first-class citizen. i.e, Select-String is just a wrapper for regex string matching. I made a tweak to my answer as per your response and that is also why the -Replace switch exists in PowerShell for string maniulations. – postanote Mar 03 '20 at 04:27
  • OK, I know a little of the RegExp thing but I'm wondering why my CSS selector not work. This `$movieTitleSelector = "#title-overview-widget > div.vital > div.title_block > div > div.titleBar > div.title_wrapper > h1 :not(span)"`. I hope you have the answer. – mymicrosoftmylife Mar 03 '20 at 04:41