The OP has supplied a data frame with only one row. Therefore, it is unclear what the expected result is in case of multiple rows with varying number of words in text
. Is it required that
- the resulting columns contain the same number of words (if sufficient words are available), or,
- each row is split up separately?
Solution for case 1
If the requirement is that each column should contain the same number of words across all rows (if sufficient words are available), the row with the most words determines the distribution. Columns of rows with less words are filled up from the left (left aligned).
library(data.table)
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
, paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))][
, dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
ID Text1 Text2 Text3 Text4 Text5
1: 1 This is a very long piece of string. This contains many lines.
2: 2 This is a very long piece of string. It contains one or two more words.
3: 3 Short text
4: 4 Shorter
Columns Text1
to Text4
contain the same number of words (3 each) for rows 1 and 2. The rows with less words than columns are fill up from the left.
Data
library(data.table)
DT <- fread(
'ID Text
1 "This is a very long piece of string. This contains many lines."
2 "This is a very long piece of string. It contains one or two more words."
3 "Short text"
4 "Shorter"')
Explanation
After coersion to data.table, the text in each row is split up at word boundaries and returned in long format (which might be seen as equivalent to a time series):
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID]
ID V1
1: 1 This
2: 1 is
3: 1 a
4: 1 very
5: 1 long
6: 1 piece
7: 1 of
8: 1 string.
9: 1 This
10: 1 contains
11: 1 many
12: 1 lines.
13: 2 This
14: 2 is
15: 2 a
16: 2 very
17: 2 long
18: 2 piece
19: 2 of
20: 2 string.
21: 2 It
22: 2 contains
23: 2 one
24: 2 or
25: 2 two
26: 2 more
27: 2 words.
28: 3 Short
29: 3 text
30: 4 Shorter
ID V1
Then the words are concatenated again using a computed grouping variable which uses the cut()
function on the rowdid()
numbering to create n_brks
chunks:
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
, paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))]
ID cut V1
1: 1 (0.986,3.8] This is a
2: 1 (3.8,6.6] very long piece
3: 1 (6.6,9.4] of string. This
4: 1 (9.4,12.2] contains many lines.
5: 2 (0.986,3.8] This is a
6: 2 (3.8,6.6] very long piece
7: 2 (6.6,9.4] of string. It
8: 2 (9.4,12.2] contains one or
9: 2 (12.2,15] two more words.
10: 3 (0.986,3.8] Short text
11: 4 (0.986,3.8] Shorter
Finally, this result is reshaped again from long into wide format to create the expected result. The column headers are created by the rowid()
function and missing values are replaced by ""
:
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
, paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))][
, dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
Solution for case 2
If the requirement is that each row individually should be split up and the words distributed evenly, the number of words in each column will vary from column to column. Rows with less words than columns will have one word per column at most.
The solution for this case is a modification of Jaaps's suggestion:
library(data.table)
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
, ri := cut(seq_len(.N), n_brks), by = ID][
, paste(V1, collapse = " "), by = .(ID, ri)][
, dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
ID Text1 Text2 Text3 Text4 Text5
1: 1 This is a very long piece of string. This contains many lines.
2: 2 This is a very long piece of string. It contains one or two more words.
3: 3 Short text
4: 4 Shorter
Now, the number of words in each column is varying by row. E.g., columns Text2
to Text4
have 2 words each in row 1 and 3 words each in row 2. The 2 words of row 3 are placed in separate columns.