Using python regular expression to find an image path

Question

I have a variable like the one below:

var = '<img src="path_1"><p>Words</p><img src="path_2>'

Its a string, but inside is obviously html elements. How do I get the first path only (i.e. path_1) using a regex?

I am trying something like this:

match = re.match(r'src=\"[\w-]+\"', var)
print match.group(0)

I get this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

Any help is appreciated.

`match` finds only from beginning..[`If zero or more characters at the beginning of string match the regular expression pattern`](https://docs.python.org/2/library/re.html#re.match) — rock321987, Apr 26 '16 at 15:04

score 4 · Accepted Answer · edited May 23 '17 at 12:16

4

You should use an HTML parser like BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> var = '<img src="path_1"><p>Words</p><img src="path_2>'
>>> soup = BeautifulSoup(var, "html.parser")
>>> soup.img["src"]
'path_1'

As for the regex-approach, you need to make the following changes to make it work:

switch to re.search(), re.match() starts matching from the beginning of the string
add a capturing group to capture the src value
there is no need to escape double quotes

Fixed version:

>>> re.search(r'src="([\w-]+)"', var).group(1)
'path_1'

edited May 23 '17 at 12:16

Community

1
1

answered Apr 26 '16 at 15:08

alecxe

462,703
120
1,088
1,195

3

I would say : You ***should*** use an HTML parser – Pedro Lobito Apr 26 '16 at 15:09
2

@PedroLobito definitely, made changes and referenced the famous thread. Thanks. – alecxe Apr 26 '16 at 15:13
1

Wow I actually did not even know this existed. This looks like a perfect fit. Thanks a lot! – darkhorse Apr 26 '16 at 15:15
One question, how do I get the second path? Since soup.img["src"] only returns the first one. – darkhorse Apr 26 '16 at 15:24
1

@TahmidKhanNafee sure, you can use `find_all()`. E.g. the second image: `soup.find_all("img")[1]["src"]`. Or, all `src` values of all images: `[img["src"] for img in soup.find_all("img")]`. – alecxe Apr 26 '16 at 15:25

score 2 · Answer 2 · answered Apr 26 '16 at 15:10

As suggested in the comments, use search() since match() will try to match your regular expression from the beginning of the string. You can also use capturing a named group to make the code more readable:

var = '<img src="path_1"><p>Words</p><img src="path_2>'
import re
match = re.search(r'src=\"(?P<path1>[\w-]+)\"', var)
if match:
    print(match.group('path1'))

Output:

path_1

score 1 · Answer 3 · answered Apr 26 '16 at 15:36

1

Try,

path1= re.search(r'<img\s+src="(.*?)"><p>',var).group(1) # path_1

BeutifulSoup is convenient. But very slow.
HTMLParser is a lot faster. But using it is painful.
re is the fastest option and in my opinion, for stateless usecases it's worth it.

If the target text is stateful, i.e lots of nesting and capturing the semantics is important, instead of implementing a state machine e.g a parser use an available parser. I would strongly suggest lxml for parsing HTML and XML. It is a little bit less convenient than bs4 but comparable to re in speed.

answered Apr 26 '16 at 15:36

C Panda

3,297
2
11
11

It is too strong of a statement to say `BeautifulSoup` is "very slow". You can configure it to use a different parser under the hood: say `lxml`: `BeautifulSoup(data, "lxml")`. Or you can parse a part of a document via `SoupStrainer` etc. – alecxe Apr 26 '16 at 15:53
I am aware of it. Even if you use `lxml` under the hood, it is slower than `re` by a magnitude. All the object creations and look ups.. – C Panda Apr 26 '16 at 16:08

Using python regular expression to find an image path

3 Answers3