-5

the html is :

<div style="background-color:#A7A7A7;text-align:center;">
<span style="color:#FFFFFF;">{{rk_user.name}}のステータス</span>
</div>

my Regular is :

a = r'''
<div style="background-color:#([a-z0-9]+);text-align:center;">
\s*<span style="color:#(.+?);">(.+)</span>
</div>
'''

but this Regular does not Match the html ,

so what is wrong ?

thanks

zjm1126
  • 63,397
  • 81
  • 173
  • 221

5 Answers5

2

Do not use regular expressions to parse HTML. Please!

Use an HTML parser!


Why not use regex, you ask?

Community
  • 1
  • 1
Matt Ball
  • 354,903
  • 100
  • 647
  • 710
  • I would like to point out that BeautifulSoup is basically a bunch of regular expressions. This answer is still correct, but it adds an interesting perspective. – Henry May 17 '11 at 05:22
  • That's an extreme simplification. Many parsers for many languages can make use of regular expressions, but that's very different from a parser that consists of a big pile of regex. It is impossible to implement an HTML (or XML, for that matter) parser strictly using regular expressions, because HTML and XML are context-free, not regular, languages. Have you see [this question](http://stackoverflow.com/questions/2400623/if-youre-not-supposed-to-use-regular-expressions-to-parse-html-then-how-are-htm)? – Matt Ball May 17 '11 at 13:16
  • 1
    fair enough, 'basically a bunch' was dramatic. There is of course a great deal more to BeautifulSoup, and your answer to the question on this page shows in depth how regular expressions on their own are not suitable. – Henry May 17 '11 at 14:25
1

You should make the regex case insensitive because the color is #A7A7A7 and you're matching #a7a7a7.

You can try it on many sites as: http://regexpal.com/

JBernardo
  • 32,262
  • 10
  • 90
  • 115
0

At the very least, you have a case-sensitivity problem in the color. Plus, you might want to meditate on BoltClock's comment.

Michael Lorton
  • 43,060
  • 26
  • 103
  • 144
0

Like @BoltClock mentions, it is not recommended to use regex like this. If not now, sometime down the line you will regret it. There are lots of corner cases which will make the regex complex and also plain useless at times.

Anyway, at a cursory glance, for background-color you have used [a-z0-9] that will only match lower case. But the sample has uppercase. You may want to have upper case as well [a-zA-Z0-9] For the other colors also, why don't you use the same? Why a (.+?)

manojlds
  • 290,304
  • 63
  • 469
  • 417
0

In addition to what many other people are saying, you may want to use the re.UNICODE flag, since it looks like you have some Japanese characters in there.

icktoofay
  • 126,289
  • 21
  • 250
  • 231