Add a list column to a dataframe

Question

I have a dataframe with 100 rows I have a column within the dataframe which consists of text. I would like to separate the text column into sentences so that the text column becomes a list of sentences. I am splitting with stringi package function stri_split_lines

Example:

rowID       text
1         There is something wrong. It is bad. We made it better
2          The sky is blue. The sea is green.

Desired output

rowID       text 
1           [1] There is something wrong
            [2]It is bad. 
            [3]We made it better
2           [1]The sky is blue.
            [2]The sea is green.

I have tried

dataframe<-do.call(rbind.data.frame, stri_split_lines(dataframe$text, omit_empty = TRUE))

please share data with `dput()`. – s_baldur Jan 09 '19 at 16:07 — s_baldur, Jan 09 '19 at 16:07

Jared Wilber · Accepted Answer · 2019-01-09T18:21:43.570

Here ya go, a solution from the tidyverse (and no longer using stringi):

Assume your dataframe is called df.

Solution

  library(dplyr)

  df %>%
    mutate(text= strsplit(text, "(?<=[[:punct:]])\\s(?=[A-Z])", perl=T))

Explanation: The strsplit in the mutate call returns a list, so you're data frame now has a true list-column. (The string-split regex was found here)

What if I want to split the list column into multiple rows?

To split the members of that list into their own rows you have two options:

simply call tidyr::unnest on the list-column:
```
df %>% tidyr::unnest(text)
```
Use tidyr::separate_rows on the original dataframe (before creating the list-column):
```
df %>% tidyr::separate_rows(text, sep= "(?<=[[:punct:]])\\s(?=[A-Z])")
```

score 0 · Answer 2 · answered Jan 09 '19 at 16:13

0

Example:

dataframe[["text"]] <- strsplit(dataframe[["text"]], split = "\\.")
str(dataframe)

'data.frame':   2 obs. of  2 variables:
 $ rowID: int  1 2
 $ text :List of 2
  ..$ : chr  "There is something wrong" " It is bad" " We made it better"
  ..$ : chr  "The sky is blue" " The sea is green"

Data

dataframe <- data.frame(
  rowID = 1:2, 
  text = 
    c(
      "There is something wrong. It is bad. We made it better",
      "The sky is blue. The sea is green."
    ),
  stringsAsFactors = FALSE
)

answered Jan 09 '19 at 16:13

s_baldur

29,441
4
36
69

Thanks @snoram. There are not always full stops at the end of lines which is why I wanted to use stri_split_lines. Im not sure why the output cant be sent straight to the dataframe though – Sebastian Zeki Jan 09 '19 at 16:19
`stri_split_lines(dataframe$text, omit_empty = TRUE)` does not split the strings... I think that might be the issue.. – s_baldur Jan 09 '19 at 16:25
Im confused. It should split the strings and when I run it without trying to attribute it back to itself, it outputs the list with as above – Sebastian Zeki Jan 09 '19 at 16:33
2

It splits strings based on new line characters not "." so if you add new lines to the strings it will actually split them. Or you could use `stri_split_regex` if you can figure out the regex you would need – see24 Jan 09 '19 at 16:43
Agreed. `stri_split_lines()` splits based on the line the text is on, not based on sentences. – Adam Sampson Jan 09 '19 at 16:49

Add a list column to a dataframe

2 Answers2