regex for certain html links

Question

EDOT: you guys are right, bs4 is much better and have started using it, its much more intuitive and actually finds links although I'm still struggling at points haha thank you all very much

had a look and this doesnt seem to be in the other posts

so i am pretty sure I can use regex for this as the 15 links in this html page are pretty well defined I think, its an amazon page with 15 product links and I want those links input is this

<a href="\n\n\n\n\n\n https://www.amazon.co.uk/Nikon-Coolpix-L340-Bridge-Camera/dp/B00THKEKEQ/ref=zg_bs_560836_2&#10;">Nikon Coolpix L340 Bridge Camera - Bl...</a>

I have tried

import re

links = re.findall(r'^(/n/n/n/n/n/n).(")', page)

which wont work, any thoughts?

edit python shell shows youre right href="\n\n\n\n\n\n\n https://www.amazon.co.uk/Sony-DSCW800-Digital-Compact-Optical/dp/B00IK01PJC/ref=zg_bs_560836_1/277-1976309-0409436\n">Sony DSCW800 Digital Compact either way links = re.findall(r'href="(.*?)"', page) does not seem to work, which I thought it would given its just supposed to return anything between a href tag and a " — entercaspa, Jun 06 '16 at 09:09
But for future : http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la — Take_Care_, Jun 06 '16 at 09:21

Andriy Ivaneyko · Answer 1 · 2016-06-06T09:10:24.443

0

Use regexp below:

s = """<a href="\n\n\n\n\n\n https://www.amazon.co.uk/Nikon-Coolpix-L340-Bridge-Camera/dp/B00THKEKEQ/ref=zg_bs_560836_2&#10;">Nikon Coolpix L340 Bridge Camera - Bl...</a>"""

re.findall('(?<=\n\n\n\n\n\n)(.*?)"', s)

Previous regexp was looking for \n... match at the begining of string, not for case when \n in the middle of string as in sample string.

edited Jun 06 '16 at 09:10

answered Jun 06 '16 at 09:07

Andriy Ivaneyko

20,639
6
60
82

You probably want `.*?` instead of `.*` – Andrea Corbellini Jun 06 '16 at 09:10

regex for certain html links

1 Answers1