0

My end goal is to get the last part of the href i.e. company name or brand name.

Using the below examples, how can I grab the href that contains the string 'brand' for example, or 'business'?

<a class="nocolorchange" href="/guides/brand/6429-ArmHammer">
<a class="nocolorchange" href="/guides/business/5928-ChurchDwightCoInc">

I have tried:

//a[matches(@href, 'business')] # or brand

with no luck. Thanks.

sophocles
  • 13,593
  • 3
  • 14
  • 33

2 Answers2

2

First, I don't know if you are using scrapy shell but it can be useful to test those kind of things.

As matches is only available in XPath 2.0 you can try with:

//a[starts-with(@href, '/guides/business/')]
MetallimaX
  • 594
  • 4
  • 13
1
hrefs_xpath="//a/@href[contains(., 'business') or contains(.,'brand')]"

# with scrapy, you extract this xpath pattern
hrefs=response.xpath(hrefs_xpath).extract()

# then extract company names
companies=[href.rpartition('/')[-1] for href in hrefs]