0

I'm very new to R and working on tidying a data set. I have a large number of columns, where some columns (in .CSV file) contain several comma separated names. For example, I need to split and duplicate the column and give the comma-separated-names individually to each column:

enter image description here

However, I may have more complicated situation, where there are several columns (with different numerical values) with the same repeated multiple names. these column should be split (each column for each name) and to the repeated names should be added suffixes ('.1' or even '.2' if they repeated more times), see here:

enter image description here

I am actively exploring how to do it, but still no luck. Any help would be highly appreciated.

Arnold Klein
  • 2,956
  • 10
  • 31
  • 60
  • Please post a minimal & representative example include sample data. If unsure how to do that, see [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). This sounds like a job for `strsplit` or `tidyr::separate_rows`. – Maurits Evers May 31 '18 at 23:23

1 Answers1

1

Here's one way:

First lets create some dummy example data using data.table::fread

library(data.table)    
dt = fread(
"a b c,d e f,g,h
1 2 3 4 5
1 2 3 4 5", sep=' ')
#   a b c,d e f,g,h
#1: 1 2   3 4     5
#2: 1 2   3 4     5

cols = names(dt)

Now we use stringr to count occurences of commas in the names, and add columns accordingly. We use recycling in the matrix statement to fill new adjacent columns with the same values

library(stringr)
dt.new = dt[, lapply(cols, function(x) matrix(get(x), NROW(dt), str_count(x, ',')+1L))] 
names(dt.new) <- unlist(strsplit(cols, ','))
dt.new
#    a b c d e f g h
# 1: 1 2 3 3 4 5 5 5
# 2: 1 2 3 3 4 5 5 5

Similarly, in case you prefer to use a base data.frame rather than data.table we can instead do

dt.new = data.frame(lapply(cols, function(x) matrix(dt[[x]], NROW(dt), str_count(x,',')+1L)))
names(dt.new) <- unlist(strsplit(cols, ','))
dww
  • 30,425
  • 5
  • 68
  • 111
  • great answer, many thanks, however I still have a problem, while reading the .csv file. Each column name which contains two names separated by "," are converted to "." For example: 'P123, Q456' in csv file will be converted into 'P123.Q456'. Any ideas? – Arnold Klein Jun 01 '18 at 09:32