Fast data.table column split to multiple rows based on delimiter

Question

I have a data.table with 3 columns that I want to split the 3rd by a delimiter to multiple rows.

My current implementation is:

protein.ids <- c("PA0001","PA0001", "PA0002", "PA0002", "PA0002")
protein.names <- c("protein A", "protein A", "protein B", "protein B", "protein B")
peptides.ids <- c("1;3;2", "81;23;72", "7;6;8", "10;35;21", "5;2;7")

data <- data.frame(matrix(c(protein.ids, protein.names, peptides.ids), 
                          nrow = 5), 
                  stringsAsFactors = FALSE)

colnames(data) <- c("Protein IDs", "Protein Names", "Peptide IDs")

data <- data.table(data)

data[ ,list(`Peptide IDs` = unlist(strsplit(`Peptide IDs`, ";"))),
       by = list(`Protein IDs`, `Protein Names`)]

However my data.table is quite big (~1.2G) and till now it gets ~3 seconds to run, so is there a faster approach to achieve the same results or there isn't any juice worth to squeeze?

Do you want to split it to new columns or just a `list` column — akrun, Aug 01 '18 at 15:00
[This answer](https://stackoverflow.com/a/43431847/3817004) contains a benchmark comparison of various methods for [Split comma-separated strings in a column into separate rows](https://stackoverflow.com/q/13773770/3817004) (the long form). — Uwe, Aug 01 '18 at 15:54

score 4 · Accepted Answer · answered Aug 01 '18 at 14:58

4

We can use tstrsplit on the third column to split into multiple columns and assign (:=) the output to column names of interest

data[, paste0("V", 1:3) := tstrsplit(`Peptide IDs`, ";", type.convert = TRUE)]

If we need the 'long' format

library(splitstackshape)
cSplit(data, "Peptide IDs", ";", "long")

answered Aug 01 '18 at 14:58

akrun

874,273
37
540
662

1

The cSplit did the job almost at the slighty faster during the analysis, but surpass the data.table implementation during the benchmarking. Thanks! – Vangelis Theodorakis Aug 01 '18 at 15:52

Fast data.table column split to multiple rows based on delimiter

1 Answers1