1

EDOT: you guys are right, bs4 is much better and have started using it, its much more intuitive and actually finds links although I'm still struggling at points haha thank you all very much

had a look and this doesnt seem to be in the other posts

so i am pretty sure I can use regex for this as the 15 links in this html page are pretty well defined I think, its an amazon page with 15 product links and I want those links input is this

<a href="\n\n\n\n\n\n https://www.amazon.co.uk/Nikon-Coolpix-L340-Bridge-Camera/dp/B00THKEKEQ/ref=zg_bs_560836_2&#10;">Nikon Coolpix L340 Bridge Camera - Bl...</a>

I have tried

import re

links = re.findall(r'^(/n/n/n/n/n/n).(")', page)

which wont work, any thoughts?

entercaspa
  • 674
  • 2
  • 7
  • 19
  • 3
    Why are you using `/n` instead of `\n`? – Andrea Corbellini Jun 06 '16 at 09:00
  • That's a weird looking href. Are you sure it's correct? – PM 2Ring Jun 06 '16 at 09:06
  • edit python shell shows youre right href="\n\n\n\n\n\n\n https://www.amazon.co.uk/Sony-DSCW800-Digital-Compact-Optical/dp/B00IK01PJC/ref=zg_bs_560836_1/277-1976309-0409436\n">Sony DSCW800 Digital Compact either way links = re.findall(r'href="(.*?)"', page) does not seem to work, which I thought it would given its just supposed to return anything between a href tag and a " – entercaspa Jun 06 '16 at 09:09
  • But for future : http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – Take_Care_ Jun 06 '16 at 09:21

1 Answers1

0

Use regexp below:

s = """<a href="\n\n\n\n\n\n https://www.amazon.co.uk/Nikon-Coolpix-L340-Bridge-Camera/dp/B00THKEKEQ/ref=zg_bs_560836_2&#10;">Nikon Coolpix L340 Bridge Camera - Bl...</a>"""

re.findall('(?<=\n\n\n\n\n\n)(.*?)"', s)

Previous regexp was looking for \n... match at the begining of string, not for case when \n in the middle of string as in sample string.

Andriy Ivaneyko
  • 20,639
  • 6
  • 60
  • 82