Confused on this regular expression pattern in Python

Question

I wanna find 6 digit in my webpage:

<td style="width:40px;">705214</td>

My code is:

s = f.read()
m = re.search(r'\A>\d{6}\Z<', s)
l = m.group(0)

score 2 · Accepted Answer · answered Feb 24 '12 at 05:56

2

If you just want to find 6 digits in between a > and < symbol, use the following regex:

import re
s = '<td style="width:40px;">705214</td>'
m = re.search(r'>(\d{6})<', s)
l = m.groups()[0]

Note the use of parentheses ( and ) to denote a capturing group.

answered Feb 24 '12 at 05:56

David Robinson

77,383
16
167
187

Thank you, but how can I do this for multiple strings (same as above) it means if I have multiple 6 digit numbers , how can I pick them up ? – Mahdi Feb 24 '12 at 06:14
Technically a separate question, but try this: m = re.findall(r'>(\d{6})<', s + s + s) ... m is then a list of matches. – Paul Karlin Feb 24 '12 at 06:25
@PaulKarlin's answer is quite right, though note that the `s + s + s` is only for the test case (don't do that with your actual html) – David Robinson Feb 24 '12 at 06:29
Yes, sorry, it's a bit harder to demonstrate a working example in comments, which is why I subsequently edited my answer. :-) – Paul Karlin Feb 24 '12 at 06:34

Paul Karlin · Answer 2 · 2012-02-24T06:29:16.137

1

I think you want something like this:

m = re.search(r'>(\d{6})<', s)
l = m.group(1)

The ( ) around \d{6} indicate a subgroup of the result.

If you want to find multiple instances of 6-digit substrings between > and < then try this:

s = '<tag1>111111</tag1> <tag2>222222</tag2>'
m = re.findall(r'>(\d{6})<', s)

In this case, m will be ['111111','222222'].

edited Feb 24 '12 at 06:29

answered Feb 24 '12 at 05:56

Paul Karlin

840
7
21

Do you get an error, or no match? Is your page guaranteed to have a 6-digit string inside a tag like the example in your question? – Paul Karlin Feb 24 '12 at 06:08
No problem. I do recommend adding some basic error checking around the code, e.g. make sure m != None before setting l, just in case the code runs against a page that doesn't contain the desired pattern. – Paul Karlin Feb 24 '12 at 06:21
Thank you Paul , but how can I do this for multiple strings (same as above) it means if I have multiple 6 digit numbers , how can I pick them up ? – Mahdi Feb 24 '12 at 06:24
Edited my answer, hope that helps. – Paul Karlin Feb 24 '12 at 06:29
Thanks You for you replies Paul,Can I use: for NOM in m: print m.group(NOM) to print all of groups it doesnt work for me – Mahdi Feb 24 '12 at 06:38
No, findall() doesn't return a Match object, just a list that you can iterate directly, e.g. `for NOM in m: print NOM` – Paul Karlin Feb 24 '12 at 06:43

score 1 · Answer 3 · answered Feb 24 '12 at 06:00

1

You can also use a look-ahead and a look-behind for the checking:

m = re.search(r'(?<=>)\d{6}(?=<)', s)
l = m.group(0)

This regex will match to 6 digits that are preceded by a > and followed by a <.

answered Feb 24 '12 at 06:00

Sufian Latif

13,086
3
33
70

score 1 · Answer 4 · edited May 23 '17 at 12:20

You may want to check for any whitespace (tabs, space, newlines) between the tags. \s* means zero or more whitespace.

s='<td style="width:40px;">\n\n705214\t\n</td>'
m=re.search(r'>\s*(\d{6})\s*<',s)
m.groups()
('705214',)

Parsing HTML is a blast. Usually you treat the file as one long line, remove leading and trailing whitespace between the values contained inside the tags. Maybe looking into a HTML table parsing module may help, especially if you need to parse several columns.

stackoverflow answer using lxml etree Also, htmp.parser was suggested. Food for thought. (Still learning what modules python has to offer :) )

Confused on this regular expression pattern in Python

4 Answers4