How to match URLs with python regular expression?

Question

My problem is, that I want to match URLs in HTML code, which look like so: href='example.com' or using ", but I only want to extract the actual URL. I tried matching it, and then using array magic to only get the array, but since the regex match is greedy, if there is more than 1 rational match, there will be lots more which start at one ' and end at another URL's '. What regex will suit my needs?

So you want regex that will check for `href=` first, and what address is after is what do you want? Any needs for http, checking for www or anything like that? — Shan, Oct 02 '18 at 17:25
If you Google the phrase "Python regex URL", you’ll find tutorials that can explain it much better than we can in an answer here. After that, we should see the code you're using and the *specific* problem you have. [How to ask](http://stackoverflow.com/help/how-to-ask), and [... the perfect question](https://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) apply here. — Prune, Oct 02 '18 at 17:31
@Shan any URL should be matched, so a (asterisk)(dot)(asterisk) - formatting is what I want to use — DaniFoldi, Oct 02 '18 at 17:31
I think you will get a kick out of the answer on this question: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — PixelEinstein, Oct 02 '18 at 17:36

PixelEinstein · Accepted Answer · 2018-10-04T15:15:40.470

3

I would recommend NOT using regex to parse HTML. Your life will be much easier if you use something like beautifulsoup!

It's as easy as this:

from BeautifulSoup import BeautifulSoup

HTML = """<a href="https://firstwebsite.com">firstone</a><a href="https://secondwebsite.com">Ihaveurls</a>"""

s = BeautifulSoup(HTML)

for href in s.find_all('a', href=True): print("My URL: ", href['href'])

edited Oct 04 '18 at 15:15

answered Oct 02 '18 at 17:33

PixelEinstein

1,713
1
8
17

Thanks, it really did the job! – DaniFoldi Oct 02 '18 at 18:11

score 1 · Answer 2 · answered Oct 04 '18 at 12:18

1

In case if you want it to solve it using regular expression instead of using other libraries of python. Here is the solution.

import re
html = '<a href="https://www.abcde.com"></a>'
pattern = r'href=\"(.*)\"|href=\'(.*)\''
multiple_match_links = re.findall(pattern,html)
if(len(multiple_match_links) == 0):
     print("No Link Found")
else:
     print([x for x in list(multiple_match_links[0]) if len(x) > 0][0])

answered Oct 04 '18 at 12:18

sadiq shah

11
3

Thanks, it is interesting to see that it is possible in just a few lines without libraries. – DaniFoldi Oct 06 '18 at 19:54

How to match URLs with python regular expression?

2 Answers2