How to merge rows for topic modelling with same doc_id

Question

I have an excel table of speeches with the columns doc_id, speaker, text where every paragraph of each speech is in a single row. How can I merge the rows with the same doc_id (each speech has its own doc_id) that the whole speech is in one row in the text column?

I have a Windows server running R. I tried the dplyr package to summarize the rows with the same content and I read about merging in other threads, but there is only merging rows while create new columns.

actual result

doc_id    speaker     text
7001         x        Hello again,
7001         x        I want to speak about
7002         y        Ladies and Gentlemen,
7002         y        My research is about

expected result

doc_id    speaker     text
7001         x        Hello again, I want to speak about
7002         y        Ladies and Gentlemen, My research is about

Arienrhod · Accepted Answer · 2019-08-06T13:35:06.943

2

This should do the trick:

library(tidyverse)

df <- tibble(doc_id = c(7001, 7001, 7002, 7002),
             speaker = c('x', 'x', 'y', 'y'),
             text = c('Hello again, ', 'I want to speak about',
                      'Ladies and gentlemen, ', 'My research is about...'))
# if you want to concatenate text in multiple columns
df_concat <- df %>%
  group_by(doc_id, speaker) %>% 
  summarise_all(list(~paste(., collapse='')))
# or just in text
df_concat <- df %>%
  group_by(doc_id, speaker) %>% 
  summarise(text = paste(text, collapse = ""))

edited Aug 06 '19 at 13:35

answered Aug 06 '19 at 11:49

Arienrhod

2,451
1
11
19

I have a stupid question : why `summarise(paste(text, collapse =""))` after the `group_by` is not concatenating the characters in the groups? – tom Aug 06 '19 at 12:02
Because `summarise_all` is not applied on the grouping columns, only on whatever's left. – Arienrhod Aug 06 '19 at 12:04
Sorry my question was not propoerty asked. Why not just use `summarise` on the `text` only? I know it fails but for what reason? – tom Aug 06 '19 at 12:07
oh, it does work - the `summarise_all` was just a more general solution. but i'll update my answer to include both. – Arienrhod Aug 06 '19 at 13:33
Thanks a lot. I used the summarise solution and it worked out very well. I just needed the dplyr package additionally and there it is. Many thanks! – Marco G. Aug 06 '19 at 15:13

score 1 · Answer 2 · answered Aug 06 '19 at 11:57

A data.table approach, where df is your data frame

library(data.table)

setDT(df)[, .(text = paste(text, collapse = ' ')),
          by = .(doc_id, speaker)]

doc_id speaker                                       text
1:   7001       x         Hello again, I want to speak about
2:   7002       y Ladies and Gentlemen, My research is about

How to merge rows for topic modelling with same doc_id

2 Answers2