2

Given a tweet of Sina Weibo:

  tweet = "//@lilei: dd //@Bob: cc//@Girl: dd//@魏武: 利益所致 自然念念不忘// @诺什: 吸引优质  客户,摆脱屌丝男!!!//@MarkGreene: 转发微博"

Note that there is a space between // and @诺什.

I want to get a list of retweeters, like this:

  result = ['lilei', 'Bob', 'Girl', '魏武', 'MarkGreene']

I have been thinking about using the following script:

RTpattern = r'''//?@(\w+)'''
rt = re.findall(RTpattern, tweet) 

However, I failed in getting the Chinese word '魏武'.

Frank Wang
  • 1,462
  • 3
  • 17
  • 39

1 Answers1

2

Use the re.UNICODE flag:

re.UNICODE
Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character 
properties database.

tweet = u"//@lilei: dd //@Bob: cc//@Girl: dd//@魏武: 利益所致 自然念念不忘// @诺什: 吸引优质  客户,摆脱屌丝男!!!//@MarkGreene: 转发微博"
RTpattern = r'''//?@(\w+)'''
for word in re.findall(RTpattern, tweet, re.UNICODE):
    print word

# lilei
# Bob
# Girl
# 魏武
# MarkGreene
root
  • 76,608
  • 25
  • 108
  • 120
  • Thank you. I get ['lilei', 'Boy', 'Girl', '\xe9', 'MarkGreene'], rather than ['lilei', 'Bob', 'Girl', '魏武', 'MarkGreene'] – Frank Wang Mar 31 '13 at 08:03
  • 2
    You have to make the tweet a `unicode` string (note the `u`). To do that just add `tweet = tweet.decode('utf-8')` – root Mar 31 '13 at 08:05