2

I'm trying to extract an image source url from a HTML img tag.

if html data is like below:

<div> My profile <img width='300' height='300' src='http://domain.com/profile.jpg'> </div>

or

<div> My profile <img width="300" height="300" src="http://domain.com/profile.jpg"> </div>

how's the regex in python?

I had tried below:

i = re.compile('(?P<src>src=[["[^"]+"][\'[^\']+\']])')
i.search(htmldata)

but I got an error

Traceback (most recent call last):
File "<input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
demonplus
  • 5,613
  • 12
  • 49
  • 68
eachone
  • 557
  • 3
  • 11
  • 28
  • Have you already tried to create regex yourself; that would help –  Nov 21 '15 at 09:13
  • The above 2 lines of code do not give you that error. –  Nov 21 '15 at 09:14
  • Possible duplicate of [Python Regex String Extraction](http://stackoverflow.com/questions/7384275/python-regex-string-extraction) – Ozan Nov 21 '15 at 10:15

2 Answers2

10

BeautifulSoup parser is the way to go.

>>> from bs4 import BeautifulSoup
>>> s = '''<div> My profile <img width='300' height='300' src='http://domain.com/profile.jpg'> </div>'''
>>> soup = BeautifulSoup(s, 'html.parser')
>>> img = soup.select('img')
>>> [i['src'] for i in img if  i['src']]
[u'http://domain.com/profile.jpg']
>>> 
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • this is a useful answer but someone please accept the requested edits - the edit queue is apparently full people have wanted to edit this one so much lol. Remove those '>>>' chars to support copy-paste programmers :P – codeAndStuff Jun 29 '22 at 15:09
4

I adapted your code a little bit. Please take a look:

import re

url = """<div> My profile <img width="300" height="300" src="http://domain.com/profile.jpg"> </div>"""
ur11 = """<div> My profile <img width='300' height='300' src='http://domain.com/profile.jpg'> </div>"""

link = re.compile("""src=[\"\'](.+)[\"\']""")

links = link.finditer(url)
for l in links:
    print l.group()
    print l.groups()

links1 = link.finditer(ur11)
for l in links1:
    print l.groups()  

In l.groups() you can find the link.

The output is this:

src="http://domain.com/profile.jpg"
('http://domain.com/profile.jpg',)
('http://domain.com/profile.jpg',)

finditer() is a generator and allows using a for in loop.

Sources:

http://www.tutorialspoint.com/python/python_reg_expressions.htm

https://docs.python.org/2/howto/regex.html

rocksteady
  • 2,320
  • 5
  • 24
  • 40
  • it will not work if there are other attributes after src. and also your group is fails to capture `/:-.` etc. which can be part of the url. here is my pattern. `src=[\"\']([a-zA-Z0-9_\.\/\-:]+)[\"\']` – Moshe Shperling May 24 '21 at 06:50
  • There is definitely room for improvement. Thanks for your input. – rocksteady May 25 '21 at 09:34