How to use regex to parse a number from HTML?

Question

I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:

Your number is <b>123</b>

Now, how can I extract "123", i.e. the contents of the first bold text after the string "Your number is"?

Relevant: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Endophage, Jun 23 '12 at 16:41
@Endophage: [meta-relevant](http://meta.stackexchange.com/questions/73133/regex-and-html-the-long-tail-annoys-me) — georg, Jun 23 '12 at 17:19
@thg435 Assuming most if not all problems on SO are small test examples for larger problems, very relevant. The op wants to parse html with regexes... Note I didn't link the rant, just the question. — Endophage, Jun 23 '12 at 17:23

Yevgen Yampolskiy · Accepted Answer · 2012-06-23T16:56:45.217

66

import re
m = re.search("Your number is <b>(\d+)</b>",
      "xxx Your number is <b>123</b>  fdjsk")
if m:
    print m.groups()[0]

edited Jun 23 '12 at 16:56

answered Jun 23 '12 at 16:18

Yevgen Yampolskiy

7,022
3
26
23

2

Sorry for not being clear enough, However I used a slightly modified version that is working for me. re.search("Your number is (\[a-zA-Z_][a-zA-Z_0-9]*)",loginData) – Saqib Jun 24 '12 at 08:00

score 25 · Answer 2 · edited Jul 28 '22 at 14:22

25

Given s = "Your number is <b>123</b>" then:

import re 
m = re.search(r"\d+", s)

will work and give you

m.group()
'123'

The regular expression looks for 1 or more consecutive digits in your string.

Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search() to make sure that m contained a valid reference, otherwise m.group() would result in a AttributeError: exception.

Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup - it's meant for that and much more. The whole idea with BeautifulSoup is to avoid "manual" parsing using string ops or regular expressions.

edited Jul 28 '22 at 14:22

Neuron

5,141
5
38
59

answered Jun 23 '12 at 16:15

Levon

138,105
33
200
191

2

Why the downvote? This is functional and meets OP's requirements as far as I can tell. I am happy to correct any errors or improve my answer *if* given constructive feedback. However, downvotes ***without*** explanation don't help OP, SO or me. – Levon Jun 23 '12 at 16:32
1

Heh, we've all done it. As for the downvote, maybe someone wanted something more robust? Currently this would fail if there were any digits before the 123. – DSM Jun 23 '12 at 16:41
@DSM :-) .. yes, I agree, this is a narrow solution which is really pretty much just aimed at the specific problem posted. In this case testing the return value of `re.search()` wasn't necessary either, but that should also happen. – Levon Jun 23 '12 at 16:43
1

I don't think the OP wants numbers. Their requirements are quite clear: `contents of first bold text after string "Your number is"` – georg Jun 23 '12 at 17:27
1

@thg435 .. it says "`how can I extract 123,`" .. and "`..extracts a number from HTML"` .. that's what I did. Am I missing something? – Levon Jun 23 '12 at 17:29
@thg435 Sounds good. I just don't see the ambiguity or the other interpretation, isn't the first bold text after the string "Your number is" the number 123? We must be reading this very differently (I think all the other solutions also focused on getting 123). Yes, OP will let us know hopefully. – Levon Jun 23 '12 at 17:35

score 12 · Answer 3 · answered Jun 23 '12 at 16:20

12

import re
x = 'Your number is <b>123</b>'
re.search('(?<=Your number is )<b>(\d+)</b>',x).group(0)

this searches for the number that follows the 'Your number is' string

answered Jun 23 '12 at 16:20

muffel

7,004
8
57
98

8

If you only want the 123, don't you want `.group(1)`? – DSM Jun 23 '12 at 16:43

score 5 · Answer 4 · edited Apr 15 '14 at 20:56

5

import re
print re.search(r'(\d+)', 'Your number is <b>123</b>').group(0)

edited Apr 15 '14 at 20:56

the Tin Man

158,662
42
215
303

answered Feb 17 '14 at 19:20

Jacob Abraham

915
9
8

score 4 · Answer 5 · answered Jun 22 '16 at 10:45

4

The simplest way is just extract digit(number)

re.search(r"\d+",text)

answered Jun 22 '16 at 10:45

Avinash Kumar

39
2

score 2 · Answer 6 · edited Jul 07 '15 at 12:16

2

val="Your number is <b>123</b>"

Option : 1

m=re.search(r'(<.*?>)(\d+)(<.*?>)',val)

m.group(2)

Option : 2

re.sub(r'([\s\S]+)(<.*?>)(\d+)(<.*?>)',r'\3',val)

edited Jul 07 '15 at 12:16

Nikolay Kostov

16,433
23
85
123

answered Jul 07 '15 at 11:55

score 2 · Answer 7 · edited Jul 11 '18 at 20:37

2

import re
found = re.search("your number is <b>(\d+)</b>", "something.... Your number is <b>123</b> something...")

if found:
    print found.group()[0]

Here (\d+) is the grouping, since there is only one group [0] is used. When there are several groupings [grouping index] should be used.

edited Jul 11 '18 at 20:37

Stypox

963
11
18

answered Jun 14 '18 at 12:24

Sykam Sreekar Reddy

131
2
3

score 1 · Answer 8 · answered Nov 25 '19 at 12:31

1

To extract as python list you can use findall

>>> import re
>>> string = 'Your number is <b>123</b>'
>>> pattern = '\d+'
>>> re.findall(pattern,string)
['123']
>>>

answered Nov 25 '19 at 12:31

Arun

782
2
13
25

score 0 · Answer 9 · edited Oct 04 '18 at 01:58

0

You can use the following example to solve your problem:

import re

search = re.search(r"\d+",text).group(0) #returns the number that is matched in the text

print("Starting Index Of Digit", search.start())

print("Ending Index Of Digit:", search.end())

edited Oct 04 '18 at 01:58

Grant Miller

27,532
16
147
165

answered Oct 03 '18 at 21:03

sadiq shah

11
3

score 0 · Answer 10 · edited May 17 '21 at 13:38

0

import re
x = 'Your number is <b>123</b>'
output = re.search('(?<=Your number is )<b>(\d+)</b>',x).group(1)
print(output)

edited May 17 '21 at 13:38

dbc

104,963
20
228
340

answered May 17 '21 at 13:20

Anand K

11
2

1

Welcome to StackOverflow. Although this may answer the question, it would be useful to explain your code a bit. – Dominik May 17 '21 at 13:54
This is a correction to [@muffel’s answer](https://stackoverflow.com/a/11171094/3025856), and should acknowledge that source. – Jeremy Caney May 17 '21 at 16:15

How to use regex to parse a number from HTML?

10 Answers10

Option : 1

Option : 2

Linked