0

I have a dataframe in which one column is the twitter source, however, right now it looks messy. To clean it I would like to extract only: Twitter for iPhone, Twitter for Android, etc.

So I want to extract all text between ">" and "<" enter image description here

Thank you

Moniek
  • 31
  • 5
  • 1
    We’d love to help you. To improve your chances of getting an answer, please provide a [reproducable example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Also, [here](https://stackoverflow.com/help/how-to-ask) are some tips on how to ask a good question. – Eric May 31 '20 at 18:35
  • you should have a look to `html_text` from `rvest` package to extract the text from html – denis Jun 01 '20 at 20:28

2 Answers2

1

You can use sub plus backreference:

Data:

df <- data.frame(source = '<ref=""http://twitter.com/download/iphone"rel=""nofollow"">Twitter for iPhone </a>')

Solution:

sub('.*nofollow"">(Twitter for \\w+\\b).*', '\\1', df$source)

Alternatively, you can use str_extract and positive lookbehind and lookahead:

library(stringr)
str_extract(df$source, '(?<=nofollow"">)[\\w\\s]+(?=\\s</a>)')

Result:

[1] "Twitter for iPhone"
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
0

You can use the strsplit function and pass it the "<" symbol

Agata
  • 26
  • 4
  • First off, this should be `str_split`(with the underscore) and the result is not the desired output: `str_split(df$source, "<") [[1]] [1] "" [2] "ref=\"\"http://twitter.com/download/iphone\"rel=\"\"nofollow\"\">Twitter for iPhone " [3] "/a>"` – Chris Ruehlemann May 31 '20 at 21:17
  • It works splitting a string based on the parameter that you pass it. E.g. if you have "example-test" and you call it with strsplit(string,"-", fixed=T) it returns 2 values "example" and "test". So in this way you can also call in a recursive way, so can call it in the output of a previous strsplit – Agata Jun 01 '20 at 08:41
  • Agata please provide a concrete example relevant to the OP's use case, just as @ChrisRuehlemann has done. This will make your answer more helpful to the OP. – Limey Jun 01 '20 at 09:14