Break up each dataframe row text into five even chunks of text

Question

I was hoping for some assistance with this thorny string problem.

Current dataframe

ID  Text
1   This is a very long piece of string. This contains many lines.

I would like to transform it to:

ID   Text1            Text2            Text3           Text4         Text5
1    This is a        very long piece  of string.      This contains  many lines.

The string split should occur on evenly spliced amount of words. In the example above I have attempted to demonstrate the line split evenly 5 times, so each column should contain 20% of the words.

The objective behind this is to frame these words into such a manner that they can be looked at as time series data as a conversation has just been split up.

Vincent Bonhomme · Accepted Answer · 2017-09-30T13:08:26.737

There is probably a better option to do it but this works with no additional package:

First thing, we create a reproducible example:

df <- data.frame(ID=1:2,
                 Text=c("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
                        "Lorem ipsum dolor sit amet, consectetur adipiscing elit"),
                 stringsAsFactors = FALSE)

Then, chunkize is a wrapper around split+cut that is the tricky part. It takes a character, split it on spaces and into n chunks, then returns a data.frame with n many columns. (We remove names so that the rbind downwards is fine).

chunkize <- function(chr, n=5){
  x <- strsplit(chr, " ")[[1]]
  df <- as.data.frame(
    lapply(
      split(x, 
            cut(seq_along(x), 
                breaks=n)), 
      paste, collapse=" "), 
    stringsAsFactors = FALSE, col.names=NULL)
  names(df) <- NULL
  df
}

Then we simply apply it for every row. We also add the the ID column:

df_chunked <- do.call("rbind", 
                      apply(df, 1, 
                         function(x) cbind(x[1], chunkize(x[-1], 5))))

Finally, we rename columns:

colnames(df_chunked) <- c("ID", paste0("Text", 1:5))

Same thing into an handy function:

chunkize_this <- function(df, n=5){
  chunkize <- function(chr, n){
    x <- strsplit(chr, " ")[[1]]
    df <- as.data.frame(
      lapply(
        split(x, 
              cut(seq_along(x), 
                  breaks=n)), 
        paste, collapse=" "), 
      stringsAsFactors = FALSE, col.names=NULL)
    names(df) <- NULL
    df
  }

  df_chunked <- do.call("rbind", 
                        apply(df, 1, function(x) cbind(x[1], chunkize(x[-1], n))))
  colnames(df_chunked) <- c(colnames(df)[1], paste0("Text", 1:n))
  rownames(df_chunked) <- NULL
  df_chunked
}

You can try it with:

View(chunkize_this(df, 3))
View(chunkize_this(df, 5))

Another example:

df <- read.table(h=T, text=
  'ID   Text
  1    "This is a very long piece of string. This contains many lines."
  2    "This is a very long piece of string. It contains one or two more word."
  3    "Short"'
)

> chunkize_this(df, 5)
ID     Text1           Text2         Text3           Text4                Text5
1  1 This is a       very long      piece of    string. This contains many lines.
2  2 This is a very long piece of string. It contains one or       two more word.
3  3                                   Short

Wow! Thank you. This worked amazingly well. Rows are splitting up equally as I was hoping. For further context, I have been attempting to apply Topic #5 in this paper where this type of action is required. https://arxiv.org/pdf/1605.04462.pdf — treeof, Oct 01 '17 at 12:52

Jaap · Answer 2 · 2017-10-01T07:36:55.477

An alternative approaches with implementations in data.table, base R and the tidyverse. The number of parts can be hard-coded or pre-allocated:

# pre-allocating number of parts
np <- 5

The different alternatives:

1) with 'data.table':

library(data.table)

# method 1
setDT(DF)[, strsplit(Text, "\\s"), by = ID
          ][, grp := rleid(cut(1:.N, np)), by = ID
            ][, paste(V1, collapse = " "), by = .(ID, grp)
              ][, dcast(.SD, ID ~ paste0('Text', grp), value.var = "V1")]

# method 2
setDT(DF)[, strsplit(Text, ' '), by = ID
          ][, grp := {s <- ceiling(.N/np); rleid(s:(.N+s-1) %/% (.N/np))}, by = ID
            ][, paste(V1, collapse = ' '), by = .(ID, grp)
              ][, dcast(.SD, ID ~ paste0('Text', grp), value.var = 'V1')]

which both give:

   ID     Text1           Text2         Text3           Text4                Text5
1:  1   This is     a very long      piece of    string. This contains many lines.
2:  2 This is a very long piece of string. It contains one or      two more words.
3:  3     Short            text            NA              NA                   NA

2) base R:

# method 1
equal_parts <- function(x, np = 5) {
  n <- cut(seq_along(x), np)
  n <- as.integer(n)
  cumsum(c(1, diff(n) > 0))
}

# method 2
equal_parts <- function(x, np = 5) {
  n <- length(x)
  s <- ceiling(n/np)
  rl <- rle(s:(n+s-1) %/% (n/np))$lengths
  rep(seq_along(rl), rl)
}

DF.long <- stack(setNames(strsplit(DF$Text, ' '), DF$ID))

DF.long$grp <- with(DF.long, ave(values, ind, FUN =  equal_parts))
DF.agg <- aggregate(values ~ ind + grp, DF.long, paste0, collapse = ' ')

reshape(DF.agg, idvar = 'ind', timevar = 'grp', direction = 'wide')

which gives:

  ind  values.1        values.2      values.3        values.4             values.5
1   1   This is     a very long      piece of    string. This contains many lines.
2   2 This is a very long piece of string. It contains one or      two more words.
3   3     Short            text          <NA>            <NA>                 <NA>

3) 'tidyverse':

library(dplyr)
library(tidyr)
separate_rows(DF, Text) %>% 
  group_by(ID) %>% 
  mutate(grp = equal_parts(Text)) %>%     # using the 'equal_parts'-function from the base R solution
  group_by(grp, add = TRUE) %>% 
  summarise(Text = paste0(Text, collapse = ' ')) %>% 
  spread(grp, Text)

which gives:

# A tibble: 3 x 6
# Groups:   ID [3]
     ID       `1`             `2`           `3`             `4`                  `5`
* <int>     <chr>           <chr>         <chr>           <chr>                <chr>
1     1   This is     a very long      piece of    string. This contains many lines.
2     2 This is a very long piece of string. It contains one or      two more words.
3     3     Short            text          <NA>            <NA>                 <NA>

Used data:

DF <- structure(list(ID = 1:3, Text = c("This is a very long piece of string. This contains many lines.", 
                                        "This is a very long piece of string. It contains one or two more words.", 
                                        "Short text")),
                .Names = c("ID", "Text"), row.names = c(NA, -3L), class = "data.frame")

This doesn't give the correct outcome imo, as the number of words per column varies by row ;-) — Uwe, Oct 01 '17 at 07:32
@Uwe It does. OP wants to split the text in 5 parts (which also becomes clear when looking at the desired output). — Jaap, Oct 01 '17 at 07:39
Perhaps, you are right but the sample data with just one row gives room for interpretation / speculation. BTW, my gut feeling that it might be an X-Y problem. — Uwe, Oct 01 '17 at 07:46
Many thanks for this solution. I am most comfortable using tidyverse, so have used that to implement the solution. It has worked exactly as I have hoped. Thank you. — treeof, Oct 01 '17 at 13:16

Uwe · Answer 3 · 2017-10-01T07:35:56.400

The OP has supplied a data frame with only one row. Therefore, it is unclear what the expected result is in case of multiple rows with varying number of words in text. Is it required that

the resulting columns contain the same number of words (if sufficient words are available), or,
each row is split up separately?

Solution for case 1

If the requirement is that each column should contain the same number of words across all rows (if sufficient words are available), the row with the most words determines the distribution. Columns of rows with less words are filled up from the left (left aligned).

library(data.table)
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
  , paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))][
    , dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]

   ID      Text1           Text2           Text3                Text4           Text5
1:  1  This is a very long piece of string. This contains many lines.                
2:  2  This is a very long piece   of string. It      contains one or two more words.
3:  3 Short text                                                                     
4:  4    Shorter

Columns Text1 to Text4 contain the same number of words (3 each) for rows 1 and 2. The rows with less words than columns are fill up from the left.

Data

library(data.table)

DT <- fread(
  'ID   Text
   1    "This is a very long piece of string. This contains many lines."
   2    "This is a very long piece of string. It contains one or two more words."
   3    "Short text"
   4     "Shorter"')

Explanation

After coersion to data.table, the text in each row is split up at word boundaries and returned in long format (which might be seen as equivalent to a time series):

n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID]

    ID       V1
 1:  1     This
 2:  1       is
 3:  1        a
 4:  1     very
 5:  1     long
 6:  1    piece
 7:  1       of
 8:  1  string.
 9:  1     This
10:  1 contains
11:  1     many
12:  1   lines.
13:  2     This
14:  2       is
15:  2        a
16:  2     very
17:  2     long
18:  2    piece
19:  2       of
20:  2  string.
21:  2       It
22:  2 contains
23:  2      one
24:  2       or
25:  2      two
26:  2     more
27:  2   words.
28:  3    Short
29:  3     text
30:  4  Shorter
    ID       V1

Then the words are concatenated again using a computed grouping variable which uses the cut() function on the rowdid() numbering to create n_brks chunks:

setDT(DT)[, strsplit(Text, "\\s"), by = ID][
  , paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))]

    ID         cut                   V1
 1:  1 (0.986,3.8]            This is a
 2:  1   (3.8,6.6]      very long piece
 3:  1   (6.6,9.4]      of string. This
 4:  1  (9.4,12.2] contains many lines.
 5:  2 (0.986,3.8]            This is a
 6:  2   (3.8,6.6]      very long piece
 7:  2   (6.6,9.4]        of string. It
 8:  2  (9.4,12.2]      contains one or
 9:  2   (12.2,15]      two more words.
10:  3 (0.986,3.8]           Short text
11:  4 (0.986,3.8]              Shorter

Finally, this result is reshaped again from long into wide format to create the expected result. The column headers are created by the rowid() function and missing values are replaced by "":

setDT(DT)[, strsplit(Text, "\\s"), by = ID][
  , paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))][
    , dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]

Solution for case 2

If the requirement is that each row individually should be split up and the words distributed evenly, the number of words in each column will vary from column to column. Rows with less words than columns will have one word per column at most.

The solution for this case is a modification of Jaaps's suggestion:

library(data.table)
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
  , ri := cut(seq_len(.N), n_brks), by = ID][
    , paste(V1, collapse = " "), by = .(ID, ri)][
      , dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]

   ID     Text1           Text2         Text3           Text4                Text5
1:  1 This is a       very long      piece of    string. This contains many lines.
2:  2 This is a very long piece of string. It contains one or      two more words.
3:  3     Short            text                                                   
4:  4   Shorter

Now, the number of words in each column is varying by row. E.g., columns Text2 to Text4 have 2 words each in row 1 and 3 words each in row 2. The 2 words of row 3 are placed in separate columns.

This doesn't give the correct outcome imo, see the first row which doesn't have five text groups. — Jaap, Sep 30 '17 at 20:58
A possible alternative: `setDT(DT)[, strsplit(Text, "\\s"), by = ID][, ri := rowid(ID)][, ri := cut(ri, 5), by = ID][, paste(V1, collapse = " "), by = .(ID, ri)][, dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]` — Jaap, Sep 30 '17 at 21:05
Apologies about any confusion caused. It was case 1 for the question I had. Also your solution works wonderfully. Thank you. — treeof, Oct 01 '17 at 12:59
Appreciate your feedback and confirmation that you were seeking a solution for case 1. But, why did you accept a solution for case 2, then? (Please, don't get me wrong. I'm just curious) — Uwe, Oct 01 '17 at 13:08
Oh wait. I realise what you mean. Case 2 is what I am looking for. I needed to re-read that part carefully. — treeof, Oct 01 '17 at 13:22

Break up each dataframe row text into five even chunks of text

3 Answers3

Solution for case 1

Data

Explanation

Solution for case 2