Finding all links matching specific URL template in an HTML page

Question

So lets say I have the following base url http://example.com/Stuff/preview/v/{id}/fl/1/t/. There are a number of urls with different {id}s on the page being parsed. I want to find all the links matching this template in an HTML page.

I can use xpath to just match to a part of the template//a[contains(@href,preview/v] or just use regexes, but I was wondering if anyone knew a more elegant way to match to the entire template using xpath and regexes so its fast and the matches are definitely correct.

Thanks.

Edit. I timed it on a sample page. With my internet connection and 100 trials the iteration takes 0.467 seconds on average and BeautifulSoup takes 0.669 seconds.

Also if you have Scrapy its one can use Selectors.

  data=get(url).text
  sel = Selector(text=data, type="html")
  a=sel.xpath('//a[re:test(@href,"/Stuff/preview/v/\d+/fl/1/t/")]//@href').extract()

Average time on this is also 0.467

score 3 · Accepted Answer · edited May 23 '17 at 10:32

3

You cannot use regexes in the xpath expressions using lxml, since lxml supports xpath 1.0 and xpath 1.0 doesn't support regular expression search.

Instead, you can find all the links on a page using iterlinks(), iterate over them and check the href attribute value:

import re
import lxml.html

tree = lxml.html.fromstring(data)

pattern = re.compile("http://example.com/Stuff/preview/v/\d+/fl/1/t/")
for element, attribute, link, pos in tree.iterlinks():
    if not pattern.match(link):
        continue
    print link

An alternative option would be to use BeautifulSoup html parser:

import re
from bs4 import BeautifulSoup

data = "your html"
soup = BeautifulSoup(data)

pattern = re.compile("http://example.com/Stuff/preview/v/\d+/fl/1/t/")
print soup.find_all('a', {'href': pattern})

To make BeautifulSoup parsing faster you can let it use lxml:

soup = BeautifulSoup(data, "lxml")

Also, you can make use of a SoupStrainer class that lets you parse only specific web page parts instead of a whole page.

Hope that helps.

edited May 23 '17 at 10:32

Community

1
1

answered Jun 23 '14 at 18:29

alecxe

462,703
120
1,088
1,195

This works , but I was more inclined on using xpath, because BeautifulSoup is fairly very slow and I am doing this matching a very large number of times. Iterating might be faster, but well haven't tested that. – Artii Jun 23 '14 at 18:32
@Artii please see the update. I'm still working on the answer though. – alecxe Jun 23 '14 at 18:36
I timed it on a sample page. With my internet connection and 100 trials the iteration takes 0.467 seconds on average and BeautifulSoup takes 0.669 seconds. – Artii Jun 23 '14 at 18:55
@zx81 thanks, I see you like regular expressions and non-regex solutions :) – alecxe Jun 23 '14 at 22:35
I absolutely love regex, and I almost never give non-regex solutions to regex questions... And I know that it's often not the best tool... So I enjoy it when someone shows other ways! :) – zx81 Jun 23 '14 at 22:37

Finding all links matching specific URL template in an HTML page

1 Answers1