1

I have a variable like the one below:

var = '<img src="path_1"><p>Words</p><img src="path_2>'

Its a string, but inside is obviously html elements. How do I get the first path only (i.e. path_1) using a regex?

I am trying something like this:

match = re.match(r'src=\"[\w-]+\"', var)
print match.group(0)

I get this error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

Any help is appreciated.

darkhorse
  • 8,192
  • 21
  • 72
  • 148
  • `match` finds only from beginning..[`If zero or more characters at the beginning of string match the regular expression pattern`](https://docs.python.org/2/library/re.html#re.match) – rock321987 Apr 26 '16 at 15:04

3 Answers3

4

You should use an HTML parser like BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> var = '<img src="path_1"><p>Words</p><img src="path_2>'
>>> soup = BeautifulSoup(var, "html.parser")
>>> soup.img["src"]
'path_1'

As for the regex-approach, you need to make the following changes to make it work:

  • switch to re.search(), re.match() starts matching from the beginning of the string
  • add a capturing group to capture the src value
  • there is no need to escape double quotes

Fixed version:

>>> re.search(r'src="([\w-]+)"', var).group(1)
'path_1'
Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 3
    I would say : You ***should*** use an HTML parser – Pedro Lobito Apr 26 '16 at 15:09
  • 2
    @PedroLobito definitely, made changes and referenced the famous thread. Thanks. – alecxe Apr 26 '16 at 15:13
  • 1
    Wow I actually did not even know this existed. This looks like a perfect fit. Thanks a lot! – darkhorse Apr 26 '16 at 15:15
  • One question, how do I get the second path? Since soup.img["src"] only returns the first one. – darkhorse Apr 26 '16 at 15:24
  • 1
    @TahmidKhanNafee sure, you can use `find_all()`. E.g. the second image: `soup.find_all("img")[1]["src"]`. Or, all `src` values of all images: `[img["src"] for img in soup.find_all("img")]`. – alecxe Apr 26 '16 at 15:25
2

As suggested in the comments, use search() since match() will try to match your regular expression from the beginning of the string. You can also use capturing a named group to make the code more readable:

var = '<img src="path_1"><p>Words</p><img src="path_2>'
import re
match = re.search(r'src=\"(?P<path1>[\w-]+)\"', var)
if match:
    print(match.group('path1'))

Output:

path_1
Cyb3rFly3r
  • 1,321
  • 7
  • 12
1

Try,

path1= re.search(r'<img\s+src="(.*?)"><p>',var).group(1) # path_1
  1. BeutifulSoup is convenient. But very slow.

  2. HTMLParser is a lot faster. But using it is painful.

  3. re is the fastest option and in my opinion, for stateless usecases it's worth it.

If the target text is stateful, i.e lots of nesting and capturing the semantics is important, instead of implementing a state machine e.g a parser use an available parser. I would strongly suggest lxml for parsing HTML and XML. It is a little bit less convenient than bs4 but comparable to re in speed.

C Panda
  • 3,297
  • 2
  • 11
  • 11
  • It is too strong of a statement to say `BeautifulSoup` is "very slow". You can configure it to use a different parser under the hood: say `lxml`: `BeautifulSoup(data, "lxml")`. Or you can parse a part of a document via `SoupStrainer` etc. – alecxe Apr 26 '16 at 15:53
  • I am aware of it. Even if you use `lxml` under the hood, it is slower than `re` by a magnitude. All the object creations and look ups.. – C Panda Apr 26 '16 at 16:08