Using Regex for extracting Usernames from Twitter Data

Question

I am trying to extract names from Twitter text with the help of regex. But, despite the pattern the value returned is none, which not exactly the case. Where my code has wrong, I have no idea. I am using jupyter lab.

Sample text is pd.Series full_text

0    RT @SeamusHughes: The Taliban Stamp of approva...
1    RT @WFaqiri: Taliban and Afghan groups find co...
2    RT @DavidCornDC: Imagine what Fox News would h...
3    RT @DavidCornDC: Imagine what Fox News would h...
4    RT @billroggio: Even if you are inclined to tr...
5    RT @billroggio: I am sure we will hear the arg...
6    RT @KFILE: This did happen and it went exactly...
Name: full_text, dtype: object

My function defined is as follows:

def extract_user(text):
        m = re.search(r"RT\s@\w+:", text)
        return m

And, I apply the above function as below:

full_text.apply(extract_user)

But the values that I get in return are as follows:

0        None
1        None
2        None
3        None
4        None
         ... 
21299    None
21300    None
21301    None
21302    None
21303    None
Name: full_text, Length: 21304, dtype: object

almost same syntax with `pandas`: `full_text.str.match("RT\s@\w+:")` — Quang Hoang, Mar 05 '20 at 04:25

moys · Answer 1 · 2020-03-05T04:35:10.273

1

You can do much more simply with the code below

df.A.str.extract(r"(@\w+)") #A is the column name

Output

    0
0   @SeamusHughes
1   @WFaqiri
2   @DavidCornDC
3   @DavidCornDC
4   @billroggio
5   @billroggio
6   @KFILE

If you want only the names & not the @ symbol, use df.A.str.extract(r"@(\w+)")

Output

    0
0   SeamusHughes
1   WFaqiri
2   DavidCornDC
3   DavidCornDC
4   billroggio
5   billroggio
6   KFILE

edited Mar 05 '20 at 04:35

answered Mar 05 '20 at 04:25

moys

7,747
2
11
42

thanks! all your codes are terrific, but I would be highly obliged if someone can point out the mistake in my code. – ambrish dhaka Mar 05 '20 at 04:27
1

In your code `re.search(r"RT\s@\w+:", text)` returns the search object, not the value. Print `m` & print `m.group()` to see the difference. – moys Mar 05 '20 at 04:32
@ambrishdhaka (1) using `fulltext.str` instead of `fulltext` and then (2) using a capturing group to grab the actual text. – David542 Mar 05 '20 at 04:32
@moys I got to learn this from you `m = re.search(r"RT\s@\w+:", text) print(m.group()[4:-1])` thanks. – ambrish dhaka Mar 05 '20 at 04:48

score 1 · Accepted Answer · answered Mar 05 '20 at 04:25

How about something like this using a lambda function within it:

>>> df[0].apply(lambda text: re.search(r'RT\s@([^:]+)',text).group(1))
0    SeamusHughes
1         WFaqiri
2     DavidCornDC
3     DavidCornDC
4      billroggio
5      billroggio
6           KFILE

And putting it all together for thoroughness:

import pandas as pd
data = [['RT @SeamusHughes: The Taliban Stamp of approva...'],['RT @WFaqiri: Taliban and Afghan groups find co...'],['RT @DavidCornDC: Imagine what Fox News would h...'],['RT @DavidCornDC: Imagine what Fox News would h...'],['RT @billroggio: Even if you are inclined to tr...'],['RT @billroggio: I am sure we will hear the arg...'],['RT @KFILE: This did happen and it went exactly...']]
df=pd.DataFrame(data)
df[0].apply(lambda text: re.search(r'RT\s@([^:]+)',text).group(1))

# 0    SeamusHughes
# 1         WFaqiri
# 2     DavidCornDC
# 3     DavidCornDC
# 4      billroggio
# 5      billroggio
# 6           KFILE
# Name: 0, dtype: object

thanks! anonymous functions are great, but the regex string you have used should also work in my code. I tried it by replacing with yours, but no success. — ambrish dhaka, Mar 05 '20 at 04:33

Ukrainian-serge · Answer 3 · 2020-03-05T04:53:50.137

The reason this happens is because your function(extract_user) returns:

0    <re.Match object; span=(5, 22), match='RT @Sea...
1    <re.Match object; span=(5, 17), match='RT @WFa...
2    <re.Match object; span=(5, 21), match='RT @Dav...
3    ...

Now I'm no expert so take this with a grain of salt, but my guess would be that pandas doesn't have a dtype to handle the <re.Match> object your function returns and so it handles it with None. Check out this great answer if you want to dive deeper into handled dtypes.

So, assuming you want to keep all of your approach the same with minimal changes, here is an example of your function modified by simply returning the first item([0]) of each <re.Match> object.

def extract_user(text):
         m = re.search(r"RT\s@\w+:", text)
         return m[0]                        # <-- here

stuff = df.iloc[:, 0].apply(extract_user)

print(stuff)

0    RT @SeamusHughes:
1         RT @WFaqiri:
2     RT @DavidCornDC:
3     RT @DavidCornDC:
4      RT @billroggio:
5      RT @billroggio:
6           RT @KFILE:

Hope that clarifies things.

thanks! I figured out that too, then I used slicing as `return m.group()[4:-1]`. — ambrish dhaka, Mar 05 '20 at 04:53

Using Regex for extracting Usernames from Twitter Data

3 Answers3