Repeat dataframe rows based on cumsum index

Question

I have a dataframe as follows:

data.frame(title="Title", bk=c("Book 1", "Book 1", "Book 3"), ch=c("Chapter 1", "Chapter 2", "Chapter 1"))

  title     bk        ch
1 Title Book 1 Chapter 1
2 Title Book 1 Chapter 2
3 Title Book 3 Chapter 1

How do I repeat each observation based on the cumsum index below:

id=c(1,1,1,2,2,3,3,3,3)

So that the dataframe can be expanded in such a way so as to accommodate the source vector which generated the cumsum index?

  title     bk        ch   source_vector
1 Title Book 1 Chapter 1   ...
1 Title Book 1 Chapter 1   
1 Title Book 1 Chapter 1   
2 Title Book 1 Chapter 2   
2 Title Book 1 Chapter 2   
3 Title Book 3 Chapter 1   
3 Title Book 3 Chapter 1   
3 Title Book 3 Chapter 1   
3 Title Book 3 Chapter 1

How do you want to use `id` ? Or do you just want to separate each word in `content` to separate row ? — Ronak Shah, Jul 22 '19 at 14:00
The original data is Chinese text, from which I removed the punctuation with `str_split`. — Sati, Jul 22 '19 at 14:02
@akrun Looks the same to me (words to separate == length of group) but in account of not being sure, I reopened — Sotos, Jul 22 '19 at 14:09
@Sotos I think this is different from the one you tagged. There is nothing I need to know from the answer over there. — Sati, Jul 22 '19 at 14:11
I reopened but I still fail to see what you want to accomplish — Sotos, Jul 22 '19 at 14:19
I edited the question to simplify the issue so that a more generic answer can be given. — Sati, Jul 22 '19 at 16:12

akrun · Answer 1 · 2019-07-22T14:43:56.570

1

An option would be to use separate_rows

library(tidyverse)
df1 %>%
    separate_rows(content)
#  title     bk        ch content
#1 Title Book 1 Chapter 1    This
#2 Title Book 1 Chapter 1      is
#3 Title Book 1 Chapter 1     the
#4 Title Book 1 Chapter 2 content
#5 Title Book 1 Chapter 2      of
#6 Title Book 3 Chapter 1    each
#7 Title Book 3 Chapter 1 chapter
#8 Title Book 3 Chapter 1      in
#9 Title Book 3 Chapter 1   books

If we need the original rows replicated

df1 %>% 
    uncount(str_count(content, "\\w+")) %>%
    as_tibble
# A tibble: 9 x 4
#  title bk     ch        content              
#  <fct> <fct>  <fct>     <fct>                
#1 Title Book 1 Chapter 1 This is the          
#2 Title Book 1 Chapter 1 This is the          
#3 Title Book 1 Chapter 1 This is the          
#4 Title Book 1 Chapter 2 content of           
#5 Title Book 1 Chapter 2 content of           
#6 Title Book 3 Chapter 1 each chapter in books
#7 Title Book 3 Chapter 1 each chapter in books
#8 Title Book 3 Chapter 1 each chapter in books
#9 Title Book 3 Chapter 1 each chapter in books

edited Jul 22 '19 at 14:43

answered Jul 22 '19 at 13:56

akrun

874,273
37
540
662

1

So how do you handle the `per id` part here? Because If this is the solution then we agree that it is a dupe – Sotos Jul 22 '19 at 14:10
1

@Sotos I would say that if the OP comes up with a giant `for` loop and wants to fix something, would that be fair to show an easier solution without a for loop? My comment to your tagging was based on the intention of the OP's post but the output he/she gets is the samee – akrun Jul 22 '19 at 14:11
Sure. But I don't get your point. The example works because they are the same length as each group. Maybe I don't understand the question – Sotos Jul 22 '19 at 14:13
@Sotos Here, the OP comes up with an `strsplit`, created ssome 'id's and then want to get expected output in a round about way – akrun Jul 22 '19 at 14:13
@Sotos If you look at the OP's code, he is splitting by space in 'content' column – akrun Jul 22 '19 at 14:14
ahhh, ok. I see what you mean now. Then Yes, you should have shown the best way, as you did. But in the same sense it should also be duped with the simpler one :) – Sotos Jul 22 '19 at 14:14
1

*But, that doesn't happen while others are posting*...I see you are steering away from friendly discussion so I will take my leave. Have a good one Arun! – Sotos Jul 22 '19 at 14:18

score 1 · Answer 2 · answered Jul 22 '19 at 14:04

In base you can use do.call of r.bind, after you have done strsplit and cbind of each row like:

x <- data.frame(title="Title", bk=c("Book 1", "Book 1", "Book 3"), ch=c("Chapter 1", "Chapter 2", "Chapter 1"), content=c("This is the", "content of", "each chapter in books"))
do.call("rbind", by(x, 1:nrow(x), function(x) {cbind(x[-ncol(x)], str_split_content=strsplit(as.character(x$content[1]), " ")[[1]])}))
#    title     bk        ch str_split_content
#1.1 Title Book 1 Chapter 1              This
#1.2 Title Book 1 Chapter 1                is
#1.3 Title Book 1 Chapter 1               the
#2.1 Title Book 1 Chapter 2           content
#2.2 Title Book 1 Chapter 2                of
#3.1 Title Book 3 Chapter 1              each
#3.2 Title Book 3 Chapter 1           chapter
#3.3 Title Book 3 Chapter 1                in
#3.4 Title Book 3 Chapter 1             books

score 1 · Answer 3 · answered Jul 22 '19 at 14:34

1

If you simply want to expand the rows based on the number of words in content, then here is one way to do it,

library(splitstackshape)
expandRows(ddf, lengths(gregexpr("\\W+", ddf$content)) + 1, count.is.col = FALSE)

#    title     bk        ch               content
#1   Title Book 1 Chapter 1           This is the
#1.1 Title Book 1 Chapter 1           This is the
#1.2 Title Book 1 Chapter 1           This is the
#2   Title Book 1 Chapter 2            content of
#2.1 Title Book 1 Chapter 2            content of
#3   Title Book 3 Chapter 1 each chapter in books
#3.1 Title Book 3 Chapter 1 each chapter in books
#3.2 Title Book 3 Chapter 1 each chapter in books
#3.3 Title Book 3 Chapter 1 each chapter in books

answered Jul 22 '19 at 14:34

Sotos

51,121
6
32
66

@akrun I know, but based on our and with OP discussion, I thought that maybe all they needed to find out was how to expand....answering on assumptions until OP clarifies I guess – Sotos Jul 22 '19 at 14:37
What does that have to do with this answer? And Yes I know you don't downvote. I disagree... – Sotos Jul 22 '19 at 14:41
Yes, that plus the reopening/noise, etc...but I don't understand why we are discussing this... – Sotos Jul 22 '19 at 14:42

score 1 · Accepted Answer · answered Jul 22 '19 at 14:42

1

This is closer to what I was looking for:

df %>%
  mutate(str_split_content = str_split(content, " ")) %>%
  unnest()

Someone posted, then revised/removed a while ago.

The original str_split content was by punctuation, actually. So not exactly purely splitting by number of words.

answered Jul 22 '19 at 14:42

Sati

716
6
27

1

df %>% unnest(str_split_content = str_split(content, " ")) Just read the doc, and unnest allows for that :) – Pablo Rod Jul 22 '19 at 22:55

Repeat dataframe rows based on cumsum index

4 Answers4