0

I am trying to retrieve the data from the list on the left of this website.

The data is structured like this :

<ul class="sections nice_sel">
    ...
    <li class="">
        <a href="/c/london-bar/set-overviews-england-and-wales">IMPORTANT_DATA</a>
    </li>
    ...
</ul>

Where I need to retrieve each of the IMPORTANT_DATA inner HTML items from the list.

I tried following this question to get the code:

$url = "http://www.legal500.com/c/london-bar"
$html = Invoke-WebRequest $url


$thelist = $html.ParsedHtml.body.getElementsByTagName('ul') | 
    Where {$_.getAttributeNode('class').Value -eq 'sections nice_sel'}

But I'm not sure how to get the child (<li>) elements from this.

I also considered using XPath, but I can't seen to pass my $html variable into -Path:

Select-XML -Path $html -XPath "//*[contains(@class, 'sections nice_sel')]"

Select-XML : Cannot find drive. A drive with the name 'PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http' does not exist. At line:1 char:1 + Select-XML -Path $html -XPath "//*[contains(@class, 'Test')]" + ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + CategoryInfo : ObjectNotFound: (html ...rict//EN" "http:String) [Select-Xml], Driv eNotFoundException + FullyQualifiedErrorId : DriveNotFound,Microsoft.PowerShell.Commands.SelectXmlCommand


I have also tried :

$url = "http://www.legal500.com/c/london-bar"
$html = Invoke-WebRequest $url

$thelist = $html.ParsedHtml.body.getElementsByTagName('a') | 
    Where {$_.getAttributeNode('href').Value -contains '/c/london-bar/'}

But for some reason this returns nothing .. (as in $thelist is empty)

Community
  • 1
  • 1
Bassie
  • 9,529
  • 8
  • 68
  • 159
  • 1
    The parameter `-Path` specifies the path and file names of the XML files. Not the content. – vonPryz Jun 03 '16 at 09:31
  • @vonPryz Yes I see that now - after saving the file as `html` I tried again with the filepath, but got a `Select-XML : The file 'C:\users\Desktop\mylisthtml.html' can not be read: Name cannot begin with the '' character, hexadecimal value 0x0D` – Bassie Jun 03 '16 at 09:34
  • 1
    Take a look: `$thelist | ForEach-Object {$_.InnerHtml ; pause}` – JosefZ Jun 03 '16 at 09:37
  • @JosefZ I managed to get that, but then I am left with a long string which is HTML, but it doesn't let me perform any HTML operations on it (e.g. .InnerHtml) – Bassie Jun 03 '16 at 09:40

0 Answers0