I'm crawling through an HTML page and I want to extract the img srcs and the a hrefs.
On the particular site, all of them are encapsulated in double quotes.
I've tried a wide variety of regexps with no success. Assume characters inside the double-quotes will be [-\w/] (printable characters [a-zA-Z\d-_] and / and .)
In python:
re.search(r'img\s+src="(?P<src>[\w-/]+_"', line)
Doesn't return anything, but
re.search(r'img\s+src="(?P[-\w[/]]+)"', line)
Returns wayy to much (i.e., does not stop at the " ).
I need help creating the right regexp. Thanks in advance!