0

I'm trying in python to search in html pages.

I need to find somthing inside the pages alle links there have a kind of match and after that the hole url to return.

My link can look link this href="http://example.com/page/subpage/unik-id-12345" and i have trying to wirte a small regex to get a sample out.

href\=\"(.*)\">

The problem is its taken everything inside, and i can't find how i can search only somthing inside the href tag.

hope you understand and hope you can help me to fix this issue.

what i want its search after eg. example.com/page

ParisNakitaKejser
  • 12,112
  • 9
  • 46
  • 66
  • Two things: 1) You shouldn't use regex to parse HTML. That's the job of `BeautifulSoup` or other HTML/XML parsers. 2) What method are you using to get to the group that you want? `re.match(r'href\=\"(.*)\"', href="http://example.com/page/subpage/unik-id-12345").group(1)` works just fine. – tblznbits Jan 08 '16 at 20:24
  • The problem is that regex quantifiers are greedy by default, so '*' means to match as much as possible (which will often read behind where you want it to). The trick is to make the quantifier lazy, so it only reads as much as needed, and no more. – Matthew Jan 08 '16 at 20:27
  • Possible duplicate of [Getting parts of a URL (Regex)](http://stackoverflow.com/questions/27745/getting-parts-of-a-url-regex) – Jeffrey Swan Jan 08 '16 at 20:41

3 Answers3

3
import re
s = 'href="http://example.com/page/subpage/unik-id-12345">'
res = re.search('href=\"(.+?)\">', s).group(1)
print(res)
# Output: http://example.com/page/subpage/unik-id-12345

Btw, better to use specific libraries, like lxml, for html parsing.

drjackild
  • 473
  • 4
  • 17
1
import re
regex = re.compile('<href="(.*)">')
url = '<href="https://stackoverflow.com/">'
m = regex.search(url)

Then you can get the group

>>> m.group(0)
'<href="https://stackoverflow.com/">'
>>> m.group(1)
'https://stackoverflow.com/'

PS: if you are trying to do web scraping it would be easier to use libraries specifically designed for that like beautifulsoup. You can find tutorials easily on the web on how to use it.

Digisec
  • 700
  • 5
  • 9
  • Shouldn't that be ``? Otherwise it would match until the last "> of the string – eddy Jan 08 '16 at 20:26
  • @eddy_hunter It will if you don't have specified in the regex or if you have multiple `">` on the same string you're going against, that would make sense. Would be better to use `[^"]` this would make it even more specific. – Digisec Jan 08 '16 at 20:29
1

Are you aware of regex101.com? It's a great tool for tweaking your regexes.

If I understand your problem right, you're matching href="http://example.com/page/subpage/unik-id-12345">, and you want to just get http://example.com/page/subpage/unik-id-12345

One way would be to just grab http(s)://, followed by anything that's not a quotation mark: http(s?):\/\/[^"]*

If you have multiple links, and only want the ones inside the href tag, you'd probably have to just use your regex, then use more operations to extract just the url. (e.g. match.split("\"")[1])

Or you could just use an HTML parser like BeautifulSoup