Get substring from twitter source in R

Question

I have a dataframe in which one column is the twitter source, however, right now it looks messy. To clean it I would like to extract only: Twitter for iPhone, Twitter for Android, etc.

So I want to extract all text between ">" and "<"

Thank you

We’d love to help you. To improve your chances of getting an answer, please provide a [reproducable example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Also, [here](https://stackoverflow.com/help/how-to-ask) are some tips on how to ask a good question. — Eric, May 31 '20 at 18:35
you should have a look to `html_text` from `rvest` package to extract the text from html — denis, Jun 01 '20 at 20:28

score 1 · Answer 1 · answered May 31 '20 at 21:10

You can use sub plus backreference:

Data:

df <- data.frame(source = '<ref=""http://twitter.com/download/iphone"rel=""nofollow"">Twitter for iPhone </a>')

Solution:

sub('.*nofollow"">(Twitter for \\w+\\b).*', '\\1', df$source)

Alternatively, you can use str_extract and positive lookbehind and lookahead:

library(stringr)
str_extract(df$source, '(?<=nofollow"">)[\\w\\s]+(?=\\s</a>)')

Result:

[1] "Twitter for iPhone"

score 0 · Answer 2 · answered May 31 '20 at 18:19

0

You can use the strsplit function and pass it the "<" symbol

answered May 31 '20 at 18:19

Agata

26
4

First off, this should be `str_split`(with the underscore) and the result is not the desired output: `str_split(df$source, "<") [[1]] [1] "" [2] "ref=\"\"http://twitter.com/download/iphone\"rel=\"\"nofollow\"\">Twitter for iPhone " [3] "/a>"` – Chris Ruehlemann May 31 '20 at 21:17
It works splitting a string based on the parameter that you pass it. E.g. if you have "example-test" and you call it with strsplit(string,"-", fixed=T) it returns 2 values "example" and "test". So in this way you can also call in a recursive way, so can call it in the output of a previous strsplit – Agata Jun 01 '20 at 08:41
Agata please provide a concrete example relevant to the OP's use case, just as @ChrisRuehlemann has done. This will make your answer more helpful to the OP. – Limey Jun 01 '20 at 09:14

Get substring from twitter source in R

2 Answers2