0

I am working with a large data frame in r which includes a column containing the text content of a number of tweets. Each value starts with "RT @(account which is retweeted): ", for example "RT @RosannaXia: Here’s some deep ocean wonder in case you want to explore a different corner of our planet...". I need to change each value in this column to only include the account name ("@RosannaXia"). How would I be able to do this? I understand that I may be able to do this with gsub and regular expressions (a lookbehind and a lookahead), but when I tried the following lookahead code it did not doing anything (or show an error):

Unnested_rts$rt_user <- gsub("[a-z](?=:)", "", Unnested_rts$rt_user, perl=TRUE)

Is there a better way to do this? I am not sure what went wrong, but I am still a very inexperienced coder. Any help would be greatly appreciated!

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Hi Josh, welcome to the Stackoverflow universe! Your question should include a dummy dataset (e.g. 5-10 rows of your df) and a sample desired outcome so that users can better understand your question and help you. If you dont know how to provide dummy data, consider googling it, it is very accessible (sorry i cant provide a link now) – Pablo Herreros Cantis Jul 10 '21 at 01:51
  • Links: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info – r2evans Jul 10 '21 at 11:14

2 Answers2

0

You can extract everything from @ till a colon (:).

x <- "RT @RosannaXia: Here’s some deep ocean wonder in case you want to explore a different corner of our planet..."
sub('RT (@.*?):.*', '\\1', x)

#[1] "@RosannaXia"

For your case , it would be -

Unnested_rts$rt_user <- sub('RT (@.*?):.*', '\\1', Unnested_rts$rt_user)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

A few things:

  • according to twitter, a handle can include alphanumeric ([A-Za-z0-9]) and underscores, this needs to be in your pattern;
  • your pattern needs to capture it and preserve it, and discard everything else, since we don't always know how to match everything else, we'll stick with matching what we know and use .* on either side.
gsub(".*(@[A-Za-z0-9_]+)(?=:).*", "\\1", "RT @RosannaXia: Here’s some deep ocean wonder in case you want to explore a different corner of our planet...", perl=TRUE)
# [1] "@RosannaXia"

Since you want this for the entire column, you can probably just to

gsub(".*(@[A-Za-z0-9_]+)(?=:).*", "\\1", Unnested_rts$rt_user, perl=TRUE)

The only catch is that if there is a failed match (pattern is not found), then the entire string is returned, which may not be what you want. If you want to extract what you found, then there are several techniques that use gregexpr and regmatches, or perhaps stringr::str_extract.

r2evans
  • 141,215
  • 6
  • 77
  • 149