I have a dataframe in which one column is the twitter source, however, right now it looks messy. To clean it I would like to extract only: Twitter for iPhone, Twitter for Android, etc.
So I want to extract all text between ">" and "<"
Thank you
I have a dataframe in which one column is the twitter source, however, right now it looks messy. To clean it I would like to extract only: Twitter for iPhone, Twitter for Android, etc.
So I want to extract all text between ">" and "<"
Thank you
You can use sub
plus backreference:
Data:
df <- data.frame(source = '<ref=""http://twitter.com/download/iphone"rel=""nofollow"">Twitter for iPhone </a>')
Solution:
sub('.*nofollow"">(Twitter for \\w+\\b).*', '\\1', df$source)
Alternatively, you can use str_extract
and positive lookbehind and lookahead:
library(stringr)
str_extract(df$source, '(?<=nofollow"">)[\\w\\s]+(?=\\s</a>)')
Result:
[1] "Twitter for iPhone"
You can use the strsplit function and pass it the "<" symbol