0

I want to extract some exact words from a variable (in fact, url's) and create a new variable which contains only the extracted words. Examining the patterns I found that I want the words the characters \\"> and ", as follow:

> dados$source[1:20]
 [1] "<a href=\\\"http://twitter.com/download/iphone\\\" rel=\\\"nofollow\\\">Twitter for iPhone</a>"  

 [2] "<a href=\\\"http://twitter.com/download/android\\\" rel=\\\"nofollow\\\">Twitter for Android</a>"

 [3] "<a href=\\\"http://twitter.com\\\" rel=\\\"nofollow\\\">Twitter Web Client</a>" 

How can I do it?

alistaire
  • 42,459
  • 4
  • 77
  • 117

2 Answers2

1

If you've got HTML, use an HTML parser like rvest to parse to strings. Once you've got non-HTML strings, you can use regex.

library(purrr)    # use lapply and sapply if you prefer
library(rvest)

# representative data
links <- c("<a href=\\\"http://twitter.com/download/iphone\\\" rel=\\\"nofollow\\\">Twitter for iPhone</a>", 
    "<a href=\\\"http://twitter.com/download/android\\\" rel=\\\"nofollow\\\">Twitter for Android</a>", 
    "<a href=\\\"http://twitter.com\\\" rel=\\\"nofollow\\\">Twitter Web Client</a>")

links %>% map(read_html) %>% 
    map_chr(html_text) %>% 
    sub('Twitter (for )?', '', .)

## [1] "iPhone"     "Android"    "Web Client"
alistaire
  • 42,459
  • 4
  • 77
  • 117
-2

I am not sure I understand exactly what patterns you want to extract. However, using Regex would be the way to go. An example from the question: Removing html tags from a string in R

cleanFun <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}
Community
  • 1
  • 1