r split a string of data into multiple columns, sorted by individual variables

Question

I have a simple question about cleaning up messy data. I have a dataset that was emailed to me that contains multiple columns, each of which contains a comma separated string of numbers. Traditionally, each of these numbers should be its own variable, but this is not how these datasets are given to me. Here is an example of some data:

indication  treatment
     1,2     3
       2    2,1
      1,3   2,3

Please imagine these datasets containing close to 100 of these columns and thousands of rows, and a varying number of variables in each of these columns. My goal is to import a dataset like this, and then split each column such that each variable in the string is in its own column, but each column is split in a way that each unique variable is sorted into its own column. Like this:

indication_1    indication_2    indication_3    treatment_1 treatment_2 treatment_3
1   1   0   0   0   1
0   1   0   1   1   0
1   0   1   0   1   1

Notice that the column header has changed and the numeric value is listed as a binary 0 or 1, where 1 indicates the presence of the variable.

Ive had issues because the split functions I have been trying require me to know how many columns I need, and then don't sort the variables into their own columns after the split. Its become quite complicated, and requires me to write separate code for each individual column containing a string.

Id like a function that can take a column containing a string, split the data into separate sorted columns, make these columns a binary yes or no, and then alter the column name to indicate both the original column name and the variable in that column. Id like this to be applicable to any column of data so I dont have to rewrite or modify the function for individual columns (assuming all columns are numerical strings with a character title).

Thanks in advance.

The answers here may be helpful https://stackoverflow.com/questions/18474896/reshape-multiple-categorical-variables-to-binary-response-variables — KenHBS, Nov 27 '17 at 16:53

score 3 · Accepted Answer · answered Nov 27 '17 at 16:58

3

We can do an strsplit and then with mtabulate get the frequency

library(qdapTools)
do.call(cbind, lapply(df, function(x) mtabulate(strsplit(x, ","))))
#    indication.1 indication.2 indication.3 treatment.1 treatment.2 treatment.3
#1            1            1            0           0           0           1
#2            0            1            0           1           1           0
#3            1            0            1           0           1           1

answered Nov 27 '17 at 16:58

akrun

874,273
37
540
662

@Konrad `strsplit` needs a `character` class to split. You have `factor` class because by default `data.frame` have `stringsAsFactors= FALSE` Just use `mtabulate(strsplit(as.character(x), ","))` – akrun Nov 27 '17 at 17:11
Konrad, I simply changed my dataframe to a matrix, which makes all of the columns numeric and it worked perfectly. – Phil_T Nov 27 '17 at 18:06

r split a string of data into multiple columns, sorted by individual variables

1 Answers1