python 2 get the number between Chinese characters from the urllib2 response result

Question

I want to get the digit 805 between the html tag using urllib2 from a web page.

<span class="count">(共805张)</span>

Here is the python code I wrote to get the number:

url = "https://movie.douban.com/celebrity/1044996/photos/"
request = urllib2.Request(url,headers=headers)
response = urllib2.urlopen(request)
content = response.read().decode('utf-8')
pattern1 = re.compile(r'<span\sclass="count">(.*?)</', re.S)
result1 = re.search(pattern1, content)
total_num = result1.group(1)
total_num = total_num

But when I print the total_num, the console shows:

u'(\u5171805\u5f20)'

How can I get the number 805 expect using regular expression?

Chiheb Nexus · Answer 1 · 2017-06-09T07:23:09.550

If your html tag is always like this form:

<span class="count">(共805张)</span>

Which means the number is between two non latin characters and '(' and ')' you can use this pattern:

import re
a = <span class="count">(共805张)</span>
# This will work if theString is unicode, 
# or a string in an encoding where ASCII 
# occupies values 0 to 0x7F (latin-1, UTF-8, etc.)
final = re.findall('\([^\x00-\x7F]+(\d+)[^\x00-\x7F]+\)', a)

print final

Output:

['805']

PS: Credit to this asnwer with some modifications.

score 0 · Answer 2 · answered Jun 09 '17 at 07:12

0

Try changing your regex by this one:

pattern1 = re.compile(r'<span\sclass="count">[^<\d]*(\d+)[^<\d]*</', re.S)

This way, the group will only match the number, and not the other characters around it.

answered Jun 09 '17 at 07:12

julienc

19,087
17
82
82

python 2 get the number between Chinese characters from the urllib2 response result

2 Answers2