0

Here is a vector of sites urls and some text where each url and text is separated by space :

v <- c("url www.site1.com this is the text of the site" , "url www.site2.com this is the text of the other site" )

I'm attempting to convert to tidy format :

  url          text
www.site1.com  this is the text of the site
www.site2.com  this is the text of the other site

using :

df <- data.frame(v)

df %>% separate(v , into=c("url" , "text") , sep = " ")

but this returns :

url          text
1 url www.site1.com
2 url www.site2.com

Do need to use an alternative regex in order to achieve required tibble format ?

blue-sky
  • 51,962
  • 152
  • 427
  • 752
  • How about `df %>% separate(v , into=c("literally_just_url", "url" , "text") , sep = " ")`. You can then drop the useless literally just url column. – Gregor Thomas Dec 01 '17 at 15:31
  • Also, remember to save the result: `df <- df %>% separate(...)` – Nathan Werth Dec 01 '17 at 15:32
  • @Gregor I receive warning message : "Too many values at 2 locations: 1, 2 " why is this displayed ? https://stackoverflow.com/questions/41837430/warning-too-many-few-values-for-using-tidyr-packages-in-r suggests it's related to use of regex ? – blue-sky Dec 01 '17 at 15:37
  • @Gregor using `df %>% separate(v , into=c("literally_just_url", "url" , "text") , sep = " ")` places just first word into text column , not entire text. – blue-sky Dec 01 '17 at 15:41
  • It's displayed because there are more than 2 spaces, so there are more than 3 compenents when split on a space. Use the `extra` argument to fix. – Gregor Thomas Dec 01 '17 at 16:07

3 Answers3

2
v <- c("url www.site1.com this is the text of the site" , "url www.site2.com this is the text of the other site" )
df = data.frame(v)
tidyr::separate(df, v, into = c("literally_just_url", "url", "text"),
                sep = " ", extra = "merge")
#   literally_just_url           url                               text
# 1                url www.site1.com       this is the text of the site
# 2                url www.site2.com this is the text of the other site
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
2

What about something like :

library(tidyverse)

tibble(v = v) %>% 
  mutate_at("v", str_replace, pattern = "^url ", replacement = "") %>% 
  separate(v, c("url", "text"), sep = " ", extra = "merge")
denrou
  • 630
  • 3
  • 12
1

How about this,

df %>% 
extract(v, into = c('url', 'text'),  regex = "url\\s+(\\S+)\\s+([A-Za-z ]+)")

Explanation of the regex: Match url followed by a space using url\\s. Followed by one of more alphanumeric characters without spaces that you want to match (\\S+). Followed by another space \\s. And finally the remainder of the text with spaces ([A-Za-z ]+)

Taran
  • 265
  • 1
  • 11