1

I have a quite simple question regarding data.table dt, but as always, this package brings me to the brink of despair :D

I have a column name with these contents: e.g. "bin1_position1_ID1" and I want to split these infos into separate columns:

name                        bins   positions   IDs
--------------------       ------------------------
"bin1_position1_ID1"   ->  "bin1" "position1" "ID1"
"bin2_position2_ID2"       "bin2" "position2" "ID2"

I tried it with

dt <- dt[, bins := lapply(.SD, function(x) strsplit(x, "_")[[1]][1]), .SDcols="name"]
(and for the other new columns with [[1]][2] [[1]][3])

However, I end up having a new column bins (so far, so good), but this has the info from row 1 in every row and not the info from the same row then itself (i.e. bin1 in every row).

And I have some columns that have more infos, that I don't want to make to columns. e.g. one column has "bin5_position5_ID5_another5_more5"

Code for testing (see Maëls solution):

library(data.table)

name <- c("bin1_position1_ID1",
          "bin2_position2_ID2",
          "bin3_position3_ID3",
          "bin4_position4_ID4",
          "bin5_position5_ID5_another5_more5")

dt <- data.table(name)

dt[, c("bin", "position", "ID") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]
gernophil
  • 177
  • 2
  • 6

1 Answers1

4

Use tstrsplit with keep = 1:3 to keep only the first three columns:

dt[, c("bins", "positions", "IDs") := tstrsplit(name, "_", fixed = TRUE, keep = 1:3)]
                                name  bin  position  ID
1:                bin1_position1_ID1 bin1 position1 ID1
2:                bin2_position2_ID2 bin2 position2 ID2
3:                bin3_position3_ID3 bin3 position3 ID3
4:                bin4_position4_ID4 bin4 position4 ID4
5: bin5_position5_ID5_another5_more5 bin5 position5 ID5
Maël
  • 45,206
  • 3
  • 29
  • 67
  • Ah, that would work, but I have some columns that have other infos, that I don't want to make to columns. e.g. one column has `"bin5_position5_ID5_another5_more5"`. This raises an error with tstrsplit(). – gernophil Aug 30 '22 at 08:58
  • 2
    You should edit your question to make it apparent – Maël Aug 30 '22 at 09:00
  • If I use `keep = 3`all my columns have the string from the 3rd position. In this case there is `"IDn"` in the columns `bins`, `position` and `ID`, – gernophil Aug 30 '22 at 09:02
  • My dataset has multiple GBs, but I made a short script and placed it in the original post that produces the same problem. – gernophil Aug 30 '22 at 09:12