3

There is a list string twitter text data, for example, the following data (actually, there is a large number of text,not just these data), I want to extract the all the user name after @ and url link in the twitter text, for example: galaxy5univ and url link.

   tweet_text = ['@galaxy5univ I like you',
    'RT @BestOfGalaxies: Let's sit under the stars ...',
    '@jonghyun__bot .........((thanks)',
    'RT @yosizo: thanks.ddddd <https://yahoo.com>',
    'RT @LDH_3_yui: #fam, ccccc https://msn.news.com']

my code:

import re
pu = re.compile(r'http\S+')
pn = re.compile(r'@(\S+)')
for row in twitter_text:
   text = pu.findall(row)
   name = (pn.findall(row))
   print("url: ", text)
   print("name: ", name)

Through testing the code in a large number of twitter data, I have got that my two patterns for url and name both are wrong(although in a few twitter text data is right). Do you guys have some documents or link about extract name and url from twitter text in the case of large twitter data.

If you have advices about extracting name and url from twitter data, please tell me, thanks!

tktktk0711
  • 1,656
  • 7
  • 32
  • 59
  • 1
    `pn = re.compile(r'@([a-zA-Z0-9_]+)')` – mic4ael Jun 14 '16 at 08:58
  • Thanks for your comment, you know there is a large number of name data in the twitter data. Sometimes the name include some special characters such as # % ^,not just a-zA-Z0-9_. In this case, how to solve it? – tktktk0711 Jun 14 '16 at 08:59
  • 1
    just add them to the list of characters inside the square brackets, but remember that some of the characters need to be properly escaped – mic4ael Jun 14 '16 at 09:00
  • thanks for your comments, but I have to add all the characters inside the square brackets. If I do not know the character after @, In this case, how to solve it. I hope there is effective way to solve it(delete the ":" after the end of name). – tktktk0711 Jun 14 '16 at 09:09
  • You mean get all non-whitespace chars after `@` but not `:`? You can use `r'@([^\s:]+)'` – Wiktor Stribiżew Jun 14 '16 at 09:13
  • yes you got my meaning. I will try to do it use your advices. Thanks! – tktktk0711 Jun 14 '16 at 09:22
  • Please update the question body with actual requirements and test cases. Without that, it is impossible to help you. There are Twitter-related resources, like [Twitter mentions regex](https://github.com/regexhq/mentions-regex/blob/master/index.js) on the Web. However, your feedback proves you need something more flexible, thus, we need exact specifications to follow. – Wiktor Stribiżew Jun 14 '16 at 10:43
  • thanks for your advice. I will update my question. – tktktk0711 Jun 14 '16 at 10:48

2 Answers2

5

Note that your pn = re.compile(r'@(\S+)') regex will capture any 1+ non-whitespace characters after @.

To exclude matching :, you need to convert the shorthand \S class to [^\s] negated character class equivalent, and add : to it:

pn = re.compile(r'@([^\s:]+)')

Now, it will stop capturing non-whitespace symbols before the first :. See the regex demo.

If you need to capture until the last :, you can just add : after the capturing group: pn = re.compile(r'@(\S+):').

As for a URL matching regex, there are many on the Web, just choose the one that works best for you.

Here is an example code:

import re
p = re.compile(r'@([^\s:]+)')
test_str = "@galaxy5univ I like you\nRT @BestOfGalaxies: Let's sit under the stars ...\n@jonghyun__bot .........((thanks)\nRT @yosizo: thanks.ddddd <https://y...content-available-to-author-only...o.com>\nRT @LDH_3_yui: #fam, ccccc https://m...content-available-to-author-only...s.com"
print(p.findall(test_str)) 
p2 = re.compile(r'(?:http|ftp|https)://(?:[\w_-]+(?:(?:\.[\w_-]+)+))(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?')
print(p2.findall(test_str))
# => ['galaxy5univ', 'BestOfGalaxies', 'jonghyun__bot', 'yosizo', 'LDH_3_yui']
# => ['https://yahoo.com', 'https://msn.news.com']
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Now, I have got that my two patterns for url and name both are wrong. Do you guys have some documents or link about extract name and url from twitter text. – tktktk0711 Jun 14 '16 at 10:23
  • What is wrong about `@([^\s:]+)`? A regex for URL can be found anywhere. [Here](http://www.regexguru.com/2008/11/detecting-urls-in-a-block-of-text/) is a good resource. And here is an SO thread on [matching URLs in a larger text](http://stackoverflow.com/questions/6038061/regular-expression-to-find-urls-within-a-string). **See [this IDEONE demo](https://ideone.com/rgAy2K)**. – Wiktor Stribiżew Jun 14 '16 at 10:25
  • thanks for your passion. for example some names: @t:* d-8:. You know the names in twitter have different kind of form. – tktktk0711 Jun 14 '16 at 10:35
  • 1
    Excuse me, I have never seen user names with spaces. That means you need `@(.*):`, right? If not, please explain the *pattern* these user names fall into. If there is no pattern, it is not possible to match them. Also, here is a [link](https://github.com/regexhq/mentions-regex/blob/master/index.js) to a mentions regex used in a Twitter JS library (the pattern is compatible with Python). – Wiktor Stribiżew Jun 14 '16 at 10:36
  • I really thanks @ Wiktor Stribiżew for your help. I will read the document you mentioned. You are a kind guy. – tktktk0711 Jun 14 '16 at 10:42
  • thanks @ Wiktor Stribiżew. I will use your code to test the a large number of twitter data. I will report later. – tktktk0711 Jun 14 '16 at 10:47
  • hi @ Wiktor Stribiżew, I have tested your code, the name can be extracted, but the url result is null:['galaxy5univ', 'BestOfGalaxies', 'jonghyun__bot', 'yosizo', 'LDH_3_yui'] url: [] – tktktk0711 Jun 14 '16 at 15:18
  • Please use the URL regexps from the links I provided. Matching a URL is a long solved issue. Or post the contents you think must be matched but the url regex in my answer foes not fetch. – Wiktor Stribiżew Jun 14 '16 at 15:29
1

If the usernames doesn't contain special chars, you can use:

@([\w]+)

See Live demo

Thomas Ayoub
  • 29,063
  • 15
  • 95
  • 142
  • thanks for your comments. I got that my two patterns for extracting name after @ and url link in twitter text are wrong. You know the name and url links have a lot of kind forms. If you have some documents or links about this , please tell me! – tktktk0711 Jun 14 '16 at 10:40