Python Regex returns me the value with parentheses

Question

I'm trying to run this code:

picture = re.search("#4F9EFF;\"><img src=\"(.+?)\" width=\"120\" height=\"90\"", data)

and when i do print picture.groups(1) it returns me the value but with parentheses, why?

Output:

('http://sample.com/img/file.jpg',)

score 4 · Answer 1 · edited May 23 '17 at 12:19

4

The group is a tuple containing one element. You can access the string (which is the first match) as output[0]. The important part is the comma after the string.

BUT

DON'T PARSE HTML WITH REGEX

You should use a proper HTML parser. This will save you innumerable headaches in the future, when your regex fails to match or gets too much. Look into BeautifulSoup or lxml.

edited May 23 '17 at 12:19

Community

1
1

answered Jul 18 '11 at 12:36

Katriel

120,462
19
136
170

1) user850019 doesn't want to parse an html file, he searches a string at a particular place in a html file 2) Writing in BBB (big blue bold) letters doesn't provide explanation about why a regex would fail to match or would get too much in a html file. Without explanation, it implicitly pretends to be an authoritative argument 3) For a same task, BeautifulSoup is approximately 10 times and lxml 100 times slower than a regex (from a benchmark I did once) – eyquem Jul 22 '11 at 11:36
@eyquem: 1) This questions is about extracting the `src` attribute of an image tag in an HTML document. That's parsing. The argument for using regex is almost always "it's easier, and it works in this case". The problem is that it breaks in other cases, often for arcane reasons, and gets more and more complicated as you add in support for the various uses you see. 2) That's because this comes up so often that I can't be bothered to rehash the arguments. An obvious one is that this regex won't match ``, because it requires the `src` to be next to the `img`. – Katriel Jul 22 '11 at 11:42
@eyquem: 3) Of course BeautifulSoup is slower. It's doing more! The speed penalty is the price you pay for _getting the right answer_. – Katriel Jul 22 '11 at 11:43
1a) No, it isn't parsing,AFAIU. _"In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens (eg words), to determine its grammatical structure with respect to a given (more or less) formal grammar."_ (en.wikipedia.org/wiki/Parsing) _"extracting the src attribute of an image tag in an HTML document"_ is what I wrote: _"searching a string at a particular place in a html file"_ and this isn't analyzing the document in order to determine its HTML structure. So the OP doesn't parse the document. – eyquem Jul 22 '11 at 13:23
Yes regexes are inadequate to parse HTML/XML documents, I agree with this since I roughly understood it once I read a kind of rough explanation about this fact. But the fact that regexes are not suited to parse XML/HTML isn't an argument to warn someone that he shouldn't try to perform another process in a HTML document. – eyquem Jul 22 '11 at 13:23
1b) _"The argument for using regex is almost always "it's easier, and it works in this case". The problem is that it breaks in other cases,"_ And what is the problem of a break in other cases than the actual one ? Do you worry about the fact it isn't possible to divide with a number the items of a list if there are strings in that list, when you are sure that the actual list you have to map with a division contains ONLY numbers ? Your argument is a strange one. – eyquem Jul 22 '11 at 13:24
2) _"An obvious one is that this regex won't match , because it requires the src to be next to the img."_ The problem is the same when someone searches for a string at a particular place after a certain word in a non-HTML/XML document. In this case, the regex' pattern writer makes a mistake on the nature of the text, believing that what he searches is always after the key word: he is too naive about what the document may contain. But as soon as he thinks about the possibility of parasite words between the key word and the searched string,..... – eyquem Jul 22 '11 at 13:25
the regex tool is fully efficacious. It is a problem concerning the structure of a limited "region" and complexity of the document. Then the problem is soluble, depending on the skill of the pattern writer: the bigger the region and its structural complexity are, the more difficult is to find a efficient regex' pattern. And it's the same for a HTML/XML document as long as the goal is to match only a delimited region or limited complexity of such a document, this problem has nothing to see with the inadequation of regexes to parse the full complex structure of a HTML/XML doc. – eyquem Jul 22 '11 at 13:26
But when region and complexity grow until to reach the complete span and maximum complexity of an entire HTML/XML document, surely regexes stop to be usable. BUT IMO the global problem of impossibility to parse an entire HTML/XML doc doesn't allow to declare that's it's similarly impossible to use regexes for delimited region and limited complexity of such a doc. – eyquem Jul 22 '11 at 13:27
3) If BeautifulSoup is doing more and that the additional activity isn't necessary to obtain the desired result, the additional slowness is useless. 'More' doesn't mean 'appropriate'. – eyquem Jul 22 '11 at 13:28

score 1 · Answer 2 · answered Jul 18 '11 at 12:37

Notice the comma before the closing parenthesis? This is a tuple (albeit one with just one element in it).

As the documentation for MatchObject.groups() says:

groups([default])

Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.

As noted by other posters, you want to use MatchObject.group() instead.

score 1 · Answer 3 · answered Jul 18 '11 at 12:38

1

You should be using

picture.group(1)

not groups() in plural if you're only looking for one specific group. groups() always returns a tuple, group() is the one you're looking for.

answered Jul 18 '11 at 12:38

Mad Scientist

18,090
12
83
109

score 0 · Answer 4 · answered Jul 18 '11 at 12:38

0

groups() returns a tuple of all the groups. You want pictures.group(1) which returns the string that matched group 1.

answered Jul 18 '11 at 12:38

Ned Batchelder

364,293
75
561
662

score 0 · Answer 5 · answered Jul 18 '11 at 12:39

0

As the groups help says is returns "a tuple containing all the subgroups of the match". If you want a single group use the group method.

answered Jul 18 '11 at 12:39

Douglas Leeder

52,368
9
94
137

Python Regex returns me the value with parentheses

5 Answers5

DON'T PARSE HTML WITH REGEX