52

I want to write a simple regular expression in Python that extracts a number from HTML. The HTML sample is as follows:

Your number is <b>123</b>

Now, how can I extract "123", i.e. the contents of the first bold text after the string "Your number is"?

Gino Mempin
  • 25,369
  • 29
  • 96
  • 135
Saqib
  • 2,470
  • 3
  • 19
  • 32
  • Is the text "Your number is" actually inside any tags? – Jon Clements Jun 23 '12 at 16:41
  • 4
    Relevant: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Endophage Jun 23 '12 at 16:41
  • 1
    @Endophage: [meta-relevant](http://meta.stackexchange.com/questions/73133/regex-and-html-the-long-tail-annoys-me) – georg Jun 23 '12 at 17:19
  • @thg435 Assuming most if not all problems on SO are small test examples for larger problems, very relevant. The op wants to parse html with regexes... Note I didn't link the rant, just the question. – Endophage Jun 23 '12 at 17:23
  • 2
    I suggest to use lxml to parse HTML – 18bytes Jun 25 '12 at 12:18

10 Answers10

66
import re
m = re.search("Your number is <b>(\d+)</b>",
      "xxx Your number is <b>123</b>  fdjsk")
if m:
    print m.groups()[0]
Yevgen Yampolskiy
  • 7,022
  • 3
  • 26
  • 23
  • 2
    Sorry for not being clear enough, However I used a slightly modified version that is working for me. re.search("Your number is (\[a-zA-Z_][a-zA-Z_0-9]*)",loginData) – Saqib Jun 24 '12 at 08:00
25

Given s = "Your number is <b>123</b>" then:

import re 
m = re.search(r"\d+", s)

will work and give you

m.group()
'123'

The regular expression looks for 1 or more consecutive digits in your string.

Note that in this specific case we knew that there would be a numeric sequence, otherwise you would have to test the return value of re.search() to make sure that m contained a valid reference, otherwise m.group() would result in a AttributeError: exception.

Of course if you are going to process a lot of HTML you want to take a serious look at BeautifulSoup - it's meant for that and much more. The whole idea with BeautifulSoup is to avoid "manual" parsing using string ops or regular expressions.

Neuron
  • 5,141
  • 5
  • 38
  • 59
Levon
  • 138,105
  • 33
  • 200
  • 191
  • 2
    Why the downvote? This is functional and meets OP's requirements as far as I can tell. I am happy to correct any errors or improve my answer *if* given constructive feedback. However, downvotes ***without*** explanation don't help OP, SO or me. – Levon Jun 23 '12 at 16:32
  • 1
    Heh, we've all done it. As for the downvote, maybe someone wanted something more robust? Currently this would fail if there were any digits before the 123. – DSM Jun 23 '12 at 16:41
  • @DSM :-) .. yes, I agree, this is a narrow solution which is really pretty much just aimed at the specific problem posted. In this case testing the return value of `re.search()` wasn't necessary either, but that should also happen. – Levon Jun 23 '12 at 16:43
  • 1
    I don't think the OP wants numbers. Their requirements are quite clear: `contents of first bold text after string "Your number is"` – georg Jun 23 '12 at 17:27
  • 1
    @thg435 .. it says "`how can I extract 123,`" .. and "`..extracts a number from HTML"` .. that's what I did. Am I missing something? – Levon Jun 23 '12 at 17:29
  • @thg435 Sounds good. I just don't see the ambiguity or the other interpretation, isn't the first bold text after the string "Your number is" the number 123? We must be reading this very differently (I think all the other solutions also focused on getting 123). Yes, OP will let us know hopefully. – Levon Jun 23 '12 at 17:35
12
import re
x = 'Your number is <b>123</b>'
re.search('(?<=Your number is )<b>(\d+)</b>',x).group(0)

this searches for the number that follows the 'Your number is' string

muffel
  • 7,004
  • 8
  • 57
  • 98
5
import re
print re.search(r'(\d+)', 'Your number is <b>123</b>').group(0)
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Jacob Abraham
  • 915
  • 9
  • 8
4

The simplest way is just extract digit(number)

re.search(r"\d+",text)
2
val="Your number is <b>123</b>"

Option : 1

m=re.search(r'(<.*?>)(\d+)(<.*?>)',val)

m.group(2)

Option : 2

re.sub(r'([\s\S]+)(<.*?>)(\d+)(<.*?>)',r'\3',val)
Nikolay Kostov
  • 16,433
  • 23
  • 85
  • 123
2
import re
found = re.search("your number is <b>(\d+)</b>", "something.... Your number is <b>123</b> something...")

if found:
    print found.group()[0]

Here (\d+) is the grouping, since there is only one group [0] is used. When there are several groupings [grouping index] should be used.

Stypox
  • 963
  • 11
  • 18
1

To extract as python list you can use findall

>>> import re
>>> string = 'Your number is <b>123</b>'
>>> pattern = '\d+'
>>> re.findall(pattern,string)
['123']
>>>
Arun
  • 782
  • 2
  • 13
  • 25
0

You can use the following example to solve your problem:

import re

search = re.search(r"\d+",text).group(0) #returns the number that is matched in the text

print("Starting Index Of Digit", search.start())

print("Ending Index Of Digit:", search.end())
Grant Miller
  • 27,532
  • 16
  • 147
  • 165
sadiq shah
  • 11
  • 3
0
import re
x = 'Your number is <b>123</b>'
output = re.search('(?<=Your number is )<b>(\d+)</b>',x).group(1)
print(output)
dbc
  • 104,963
  • 20
  • 228
  • 340
Anand K
  • 11
  • 2
  • 1
    Welcome to StackOverflow. Although this may answer the question, it would be useful to explain your code a bit. – Dominik May 17 '21 at 13:54
  • This is a correction to [@muffel’s answer](https://stackoverflow.com/a/11171094/3025856), and should acknowledge that source. – Jeremy Caney May 17 '21 at 16:15