0

I have a semicolon-delimited dataset that I'm reading into R, but within it are multiple columns with CSV delimited data as well to denote groups of features on that listing.

This is what I have:

features;baths;beds;pets;cost
AC,Cable or Satellite,Clubhouse;1;1;Cats,Dogs,Small Animals, Birds;1455
Basketball Court, Cable or Satellite, Internet;2;1;Dogs;950
Basketball Court, Internet;2;1;null;650

And I'd like to turn that into:

features;baths;beds;pets;cost;AC;basketball;cable;clubhouse;internet;cats;dogs;smallAnimals;birds
AC,Cable or Satellite,Clubhouse;1;1;Cats,Dogs,Small Animals, Birds;1455;1;0;1;1;0;1;1;0;0;
...

Good news is that the categorical values in the CSV data are identical across all records, but the trouble is in how to actually extract the unique values, split them into columns, and place the appropriate indicator. I have an idea of what to do but no idea how.

Propagating
  • 130
  • 1
  • 13

1 Answers1

2

You can use cSplit_e from splitstackshape to split comma-separated value into presence/absence matrix.

library(magrittr)
library(splitstackshape)

cSplit_e(df, 'features', ',\\s*', type = 'character', fixed = FALSE, fill = 0) %>%
  cSplit_e('pets', ',\\s*', type = 'character', fixed = FALSE, fill = 0)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213