Splt CSV delimited columns inside Semicolon-Delimited Dataset into multiple separate columns in R

Question

I have a semicolon-delimited dataset that I'm reading into R, but within it are multiple columns with CSV delimited data as well to denote groups of features on that listing.

This is what I have:

features;baths;beds;pets;cost
AC,Cable or Satellite,Clubhouse;1;1;Cats,Dogs,Small Animals, Birds;1455
Basketball Court, Cable or Satellite, Internet;2;1;Dogs;950
Basketball Court, Internet;2;1;null;650

And I'd like to turn that into:

features;baths;beds;pets;cost;AC;basketball;cable;clubhouse;internet;cats;dogs;smallAnimals;birds
AC,Cable or Satellite,Clubhouse;1;1;Cats,Dogs,Small Animals, Birds;1455;1;0;1;1;0;1;1;0;0;
...

Good news is that the categorical values in the CSV data are identical across all records, but the trouble is in how to actually extract the unique values, split them into columns, and place the appropriate indicator. I have an idea of what to do but no idea how.

score 2 · Accepted Answer · answered Dec 15 '20 at 06:53

2

You can use cSplit_e from splitstackshape to split comma-separated value into presence/absence matrix.

library(magrittr)
library(splitstackshape)

cSplit_e(df, 'features', ',\\s*', type = 'character', fixed = FALSE, fill = 0) %>%
  cSplit_e('pets', ',\\s*', type = 'character', fixed = FALSE, fill = 0)

answered Dec 15 '20 at 06:53

Ronak Shah

377,200
20
156
213

Perfect, thank you. – Propagating Dec 15 '20 at 07:01
Actually, this still treats null as a unique value, which isn't ideal but workable as it can be easily dropped after the values are split. – Propagating Dec 15 '20 at 07:08
Actually those 'null' must be string values which R cannot identify separately. You should turn them to `NA` then it might behave as expected. `df[df == 'null'] <- NA` – Ronak Shah Dec 15 '20 at 07:11

Splt CSV delimited columns inside Semicolon-Delimited Dataset into multiple separate columns in R

1 Answers1