0

I tried several times and it does not work. How can I split sentences contained in a cell into different rows maintaining the rest of the values?

Example: Dataframe df has 20 columns. Row j, Column i contains some comments which are separated by " | " I want to have a new dataframe df2 which increases the amount of rows depending the number of sentences. This means, if cell j,i has Sentence A | Sentence B

Row j, Column i has Sentence A Row j+1, Column i has Sentence B Columns 1 to i-1 and i+1 to 20 have the same value in rows j and j+1.

I do not know if this has an easy solution.

Thank you very much.

icatalan
  • 101
  • 2
  • 10
  • 1
    Welcome to SO, icatalan! Please make this question *reproducible*. This includes sample code you've attempted (including listing non-base R packages, and any errors/warnings received), sample *unambiguous* data (e.g., `dput(head(x))` or `data.frame(x=...,y=...)`), and intended output given that input. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Oct 01 '20 at 20:06

2 Answers2

1

Here is a solution using 3 tidyverse packages that accounts for an unknown maximum number of comments

library(dplyr)
library(tidyr)
library(stringr)

# Create function to calculate the max number comments per observation within 
# df$col3 and create a string of unique "names"
cols <- function(x) {
    cmts <- str_count(x, "([|])")
    max_cmts <- max(cmts, na.rm = TRUE) + 1
    features <- c(sprintf("V%02d", seq(1, max_cmts)))
}

# Create the data
df1 <- data.frame(col1 = c("a", "b", "c", "d"),
                  col2 = c(1, 2, 3, 4),
                  col3 = c("fitz|buzz", NA, 
                           "hello world|today is Thursday | its 2:00|another comment|and yet another comment", "fitz"),
                  stringsAsFactors = FALSE)

# Generate the desired output
df2 <- separate(df1, col3, into = cols(x = df1$col3), 
                sep = "([|])", extra = "merge", fill = "right") %>% 
    pivot_longer(cols = cols(x = df1$col3), values_to = "comments", 
                 values_drop_na = TRUE) %>% 
    select(-name)

Which results in

df2
# A tibble: 8 x 3
  col1   col2 comments                 
  <chr> <dbl> <chr>                    
1 a         1 "fitz"                   
2 a         1 "buzz"                   
3 c         3 "hello world"            
4 c         3 "today is Thursday "     
5 c         3 " its 2:00"              
6 c         3 "another comment"        
7 c         3 "and yet another comment"
8 d         4 "fitz" 
  • Thanks Christopher! What about c("V4", "V5", "V6") if I do not know how many comments there are in the cell? – icatalan Oct 04 '20 at 10:33
  • icatalan, I modified my initial answer to account for an unknown number of comments. while it works, akrun's answer is far more parsimonious. – Christopher Michaud Oct 04 '20 at 20:20
1

We could use cSplit from splitstackshape

library(splitstackshape)
cSplit(df, 'col3', sep="\\|", "long", fixed = FALSE)
#   col1 col2                col3
#1:    a    1                fitz
#2:    a    1                buzz
#3:    b    2                 foo
#4:    b    2                 bar
#5:    c    3         hello world
#6:    c    3   today is Thursday
#7:    c    3           its 2:00
#8:    d    4                fitz

data

df <- structure(list(col1 = c("a", "b", "c", "d"), col2 = c(1, 2, 3, 
4), col3 = c("fitz|buzz", "foo|bar", "hello world|today is Thursday | its 2:00", 
"fitz")), class = "data.frame", row.names = c(NA, -4L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi @akrun, I tried this solution but this only separates into two columns and then only the numbers saved into one of this column but the text does not appear. What do you think what's happening? – icatalan Oct 19 '20 at 17:29
  • @icatalan Can you show your expected output? I used the data in my post. Please check if that also gives the issue you mentioned. If you can show the `dput` of your example, I can test it. thanks – akrun Oct 19 '20 at 17:55