Converting vector of strings to tidy format

Question

Here is a vector of sites urls and some text where each url and text is separated by space :

v <- c("url www.site1.com this is the text of the site" , "url www.site2.com this is the text of the other site" )

I'm attempting to convert to tidy format :

  url          text
www.site1.com  this is the text of the site
www.site2.com  this is the text of the other site

using :

df <- data.frame(v)

df %>% separate(v , into=c("url" , "text") , sep = " ")

but this returns :

url          text
1 url www.site1.com
2 url www.site2.com

Do need to use an alternative regex in order to achieve required tibble format ?

How about `df %>% separate(v , into=c("literally_just_url", "url" , "text") , sep = " ")`. You can then drop the useless literally just url column. — Gregor Thomas, Dec 01 '17 at 15:31
Also, remember to save the result: `df <- df %>% separate(...)` — Nathan Werth, Dec 01 '17 at 15:32
@Gregor I receive warning message : "Too many values at 2 locations: 1, 2 " why is this displayed ? https://stackoverflow.com/questions/41837430/warning-too-many-few-values-for-using-tidyr-packages-in-r suggests it's related to use of regex ? — blue-sky, Dec 01 '17 at 15:37
@Gregor using `df %>% separate(v , into=c("literally_just_url", "url" , "text") , sep = " ")` places just first word into text column , not entire text. — blue-sky, Dec 01 '17 at 15:41
It's displayed because there are more than 2 spaces, so there are more than 3 compenents when split on a space. Use the `extra` argument to fix. — Gregor Thomas, Dec 01 '17 at 16:07

score 2 · Accepted Answer · answered Dec 01 '17 at 16:06

v <- c("url www.site1.com this is the text of the site" , "url www.site2.com this is the text of the other site" )
df = data.frame(v)
tidyr::separate(df, v, into = c("literally_just_url", "url", "text"),
                sep = " ", extra = "merge")
#   literally_just_url           url                               text
# 1                url www.site1.com       this is the text of the site
# 2                url www.site2.com this is the text of the other site

score 2 · Answer 2 · answered Dec 01 '17 at 16:08

2

What about something like :

library(tidyverse)

tibble(v = v) %>% 
  mutate_at("v", str_replace, pattern = "^url ", replacement = "") %>% 
  separate(v, c("url", "text"), sep = " ", extra = "merge")

answered Dec 01 '17 at 16:08

denrou

630
3
12

Taran · Answer 3 · 2017-12-01T16:50:26.397

1

How about this,

df %>% 
extract(v, into = c('url', 'text'),  regex = "url\\s+(\\S+)\\s+([A-Za-z ]+)")

Explanation of the regex: Match url followed by a space using url\\s. Followed by one of more alphanumeric characters without spaces that you want to match (\\S+). Followed by another space \\s. And finally the remainder of the text with spaces ([A-Za-z ]+)

edited Dec 01 '17 at 16:50

answered Dec 01 '17 at 16:35

Taran

265
1
11

Converting vector of strings to tidy format

3 Answers3