I wanna find 6 digit in my webpage:
<td style="width:40px;">705214</td>
My code is:
s = f.read()
m = re.search(r'\A>\d{6}\Z<', s)
l = m.group(0)
If you just want to find 6 digits in between a >
and <
symbol, use the following regex:
import re
s = '<td style="width:40px;">705214</td>'
m = re.search(r'>(\d{6})<', s)
l = m.groups()[0]
Note the use of parentheses (
and )
to denote a capturing group.
I think you want something like this:
m = re.search(r'>(\d{6})<', s)
l = m.group(1)
The ( ) around \d{6}
indicate a subgroup of the result.
If you want to find multiple instances of 6-digit substrings between >
and <
then try this:
s = '<tag1>111111</tag1> <tag2>222222</tag2>'
m = re.findall(r'>(\d{6})<', s)
In this case, m
will be ['111111','222222']
.
You can also use a look-ahead and a look-behind for the checking:
m = re.search(r'(?<=>)\d{6}(?=<)', s)
l = m.group(0)
This regex will match to 6 digits that are preceded by a >
and followed by a <
.
You may want to check for any whitespace (tabs, space, newlines) between the tags. \s* means zero or more whitespace.
s='<td style="width:40px;">\n\n705214\t\n</td>'
m=re.search(r'>\s*(\d{6})\s*<',s)
m.groups()
('705214',)
Parsing HTML is a blast. Usually you treat the file as one long line, remove leading and trailing whitespace between the values contained inside the tags. Maybe looking into a HTML table parsing module may help, especially if you need to parse several columns.
stackoverflow answer using lxml etree Also, htmp.parser was suggested. Food for thought. (Still learning what modules python has to offer :) )