So lets say I have the following base url http://example.com/Stuff/preview/v/{id}/fl/1/t/
. There are a number of urls with different {id}s on the page being parsed. I want to find all the links matching this template in an HTML page.
I can use xpath to just match to a part of the template//a[contains(@href,preview/v]
or just use regexes, but I was wondering if anyone knew a more elegant way to match to the entire template using xpath and regexes so its fast and the matches are definitely correct.
Thanks.
Edit. I timed it on a sample page. With my internet connection and 100 trials the iteration takes 0.467 seconds on average and BeautifulSoup takes 0.669 seconds.
Also if you have Scrapy its one can use Selectors.
data=get(url).text
sel = Selector(text=data, type="html")
a=sel.xpath('//a[re:test(@href,"/Stuff/preview/v/\d+/fl/1/t/")]//@href').extract()
Average time on this is also 0.467