2

My problem is, that I want to match URLs in HTML code, which look like so: href='example.com' or using ", but I only want to extract the actual URL. I tried matching it, and then using array magic to only get the array, but since the regex match is greedy, if there is more than 1 rational match, there will be lots more which start at one ' and end at another URL's '. What regex will suit my needs?

DaniFoldi
  • 451
  • 4
  • 14
  • So you want regex that will check for `href=` first, and what address is after is what do you want? Any needs for http, checking for www or anything like that? – Shan Oct 02 '18 at 17:25
  • If you Google the phrase "Python regex URL", you’ll find tutorials that can explain it much better than we can in an answer here. After that, we should see the code you're using and the *specific* problem you have. [How to ask](http://stackoverflow.com/help/how-to-ask), and [... the perfect question](https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) apply here. – Prune Oct 02 '18 at 17:31
  • @Shan any URL should be matched, so a (asterisk)(dot)(asterisk) - formatting is what I want to use – DaniFoldi Oct 02 '18 at 17:31
  • 1
    I think you will get a kick out of the answer on this question: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – PixelEinstein Oct 02 '18 at 17:36
  • @DaniFoldi so... you just anwsered your own question here? – Shan Oct 02 '18 at 17:36

2 Answers2

3

I would recommend NOT using regex to parse HTML. Your life will be much easier if you use something like beautifulsoup!

It's as easy as this:

from BeautifulSoup import BeautifulSoup

HTML = """<a href="https://firstwebsite.com">firstone</a><a href="https://secondwebsite.com">Ihaveurls</a>"""

s = BeautifulSoup(HTML)

for href in s.find_all('a', href=True): print("My URL: ", href['href'])
PixelEinstein
  • 1,713
  • 1
  • 8
  • 17
1

In case if you want it to solve it using regular expression instead of using other libraries of python. Here is the solution.

import re
html = '<a href="https://www.abcde.com"></a>'
pattern = r'href=\"(.*)\"|href=\'(.*)\''
multiple_match_links = re.findall(pattern,html)
if(len(multiple_match_links) == 0):
     print("No Link Found")
else:
     print([x for x in list(multiple_match_links[0]) if len(x) > 0][0])
sadiq shah
  • 11
  • 3