I have a simple question about cleaning up messy data. I have a dataset that was emailed to me that contains multiple columns, each of which contains a comma separated string of numbers. Traditionally, each of these numbers should be its own variable, but this is not how these datasets are given to me. Here is an example of some data:
indication treatment
1,2 3
2 2,1
1,3 2,3
Please imagine these datasets containing close to 100 of these columns and thousands of rows, and a varying number of variables in each of these columns. My goal is to import a dataset like this, and then split each column such that each variable in the string is in its own column, but each column is split in a way that each unique variable is sorted into its own column. Like this:
indication_1 indication_2 indication_3 treatment_1 treatment_2 treatment_3
1 1 0 0 0 1
0 1 0 1 1 0
1 0 1 0 1 1
Notice that the column header has changed and the numeric value is listed as a binary 0 or 1, where 1 indicates the presence of the variable.
Ive had issues because the split functions I have been trying require me to know how many columns I need, and then don't sort the variables into their own columns after the split. Its become quite complicated, and requires me to write separate code for each individual column containing a string.
Id like a function that can take a column containing a string, split the data into separate sorted columns, make these columns a binary yes or no, and then alter the column name to indicate both the original column name and the variable in that column. Id like this to be applicable to any column of data so I dont have to rewrite or modify the function for individual columns (assuming all columns are numerical strings with a character title).
Thanks in advance.