-1

We have brands data in multiple columns with column names like "ID","Gender","Race","Country","VAR01","VAR02","VAR04","VAR05","VAR06","VAR08","VAR09","VAR13","VAR14","VAR15","VAR18","VAR19". Our task is to concatenate those brands columns data to a single column separated by a delimiter (;) which we were able to do with the following code.

R Code:

dataset<-sapply(dataset,as.character)
# Replace "NA" with blank
dataset[is.na(dataset)] <- ""
dataset<-as.data.frame(dataset)

#Concatenate columns data with semicolon(;)
dataset$Var_All<-paste(dataset$Var01,";",dataset$Var02,";",dataset$Var04,";",dataset$Var05,";",dataset$Var06,";",dataset$Var08,";",dataset$Var09,";",dataset$Var13,";",dataset$Var14,";",dataset$Var15,";",dataset$Var18,";",dataset$Var19)

#Remove blank spaces befor and after semicolon
dataset$Var_All <- gsub(" ; ", ";", dataset$Var_All)
dataset$Var_All <- gsub("; ", ";", dataset$Var_All)
dataset$Var_All <- gsub(" ;", ";", dataset$Var_All)

# Replace multiple semicolons with one semicolon step by step
dataset$Var_All <- gsub(";;;;;;;;;;;", ";", dataset$Var_All)
dataset$Var_All <- gsub(";;;;;;;;;;", ";", dataset$Var_All)
dataset$Var_All <- gsub(";;;;;;;;;", ";", dataset$Var_All)
dataset$Var_All <- gsub(";;;;;;;;", ";", dataset$Var_All)
dataset$Var_All <- gsub(";;;;;;;", ";", dataset$Var_All)
dataset$Var_All <- gsub(";;;;;;", ";", dataset$Var_All)
dataset$Var_All <- gsub(";;;;;", ";", dataset$Var_All)
dataset$Var_All <- gsub(";;;;", ";", dataset$Var_All)
dataset$Var_All <- gsub(";;;", ";", dataset$Var_All)
dataset$Var_All <- gsub(";;", ";", dataset$Var_All)


#Remove beginning and ending semicolons if any 
dataset$Var_All<-gsub("^;+|;+$", "", dataset$Var_All)

Data set:

ID  Gender Race     Country VAR01   VAR02   VAR04   VAR05   VAR06   VAR08   VAR09   VAR13   VAR14   
1   Male   Indian   India   Brand1  NA      Brand2  NA      Brand3  NA      NA      NA      Brand4  
2   Female Indian   India   NA      NA      NA      NA      NA      NA      NA      NA      NA     
3   Male   Indian   India   NA      Brand2  NA      Brand3  NA      NA      Brand5  NA      NA     
4   Male   Indian   India   Brand1  NA      NA      NA      Brand3  NA      NA      NA      Brand4
5   Female Indian   India   NA      Brand2  NA      Brand4  NA      Brand6  NA      Brand7  NA  

However my question is: Is there any way to pull the substring of the variables names i.e. VAR and concatenate those variables as one variable "VAR_All". Currently we are writing manually in paste() function and removing spaces, replacing multiple semicolons to get the desired output.

We would like to know that is there any possibility to write the code in a dynamic way so that it should calculate the number of variables which are starting with "VAR" and concatenate automatically into "VAR_All".

Desired output should be as follows:

ID  VAR_All
1   Brand1;Brand2;Brand3;Brand4
2   
3   Brand2;Brand3;Brand5
4   Brand1;Brand3;Brand4
5   Brand2;Brand4;Brand6;Brand7

Thanks for the help in advance.

  • Or use `gsub`, i.e. `gsub('\\s+' ,';' ,trimws(gsub('NA', '', do.call(paste, df[-c(1:4)]))))` – Sotos Feb 16 '18 at 13:01

1 Answers1

1

1) Use apply like this:

ix <- grep("^VAR", names(dataset))
Paste <- function(x) paste(na.omit(x), collapse = ";")
data.frame(ID = dataset$ID, VAR_All = apply(dataset[ix], 1, Paste))

giving:

  ID                     VAR_All
1  1 Brand1;Brand2;Brand3;Brand4
2  2                            
3  3        Brand2;Brand3;Brand5
4  4        Brand1;Brand3;Brand4
5  5 Brand2;Brand4;Brand6;Brand7

2) Using dplyr/tidyr

library(dplyr)
library(tidyr)

dataset %>%
  pivot_longer(starts_with("VAR")) %>%
  drop_na %>%
  group_by(ID) %>%
  summarize(VAR_All = paste(value, collapse = ";"), .groups = "drop") %>%
  filter(nzchar(VAR_All))
## # A tibble: 4 × 2
##      ID VAR_All                    
##   <int> <chr>                      
## 1     1 Brand1;Brand2;Brand3;Brand4
## 2     3 Brand2;Brand3;Brand5       
## 3     4 Brand1;Brand3;Brand4       
## 4     5 Brand2;Brand4;Brand6;Brand7

Note

The input data used, in reproducible form, is:

Lines <- '
ID  Gender Race     Country VAR01   VAR02   VAR04   VAR05   VAR06   VAR08   VAR09   VAR13   VAR14   
1   Male   Indian   India   Brand1  NA      Brand2  NA      Brand3  NA      NA      NA      Brand4  
2   Female Indian   India   NA      NA      NA      NA      NA      NA      NA      NA      NA     
3   Male   Indian   India   NA      Brand2  NA      Brand3  NA      NA      Brand5  NA      NA     
4   Male   Indian   India   Brand1  NA      NA      NA      Brand3  NA      NA      NA      Brand4
5   Female Indian   India   NA      Brand2  NA      Brand4  NA      Brand6  NA      Brand7  NA  '
dataset <- read.table(text = Lines, header = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Hi Grothendieck, Thanks for the code. It is absolutely working. But our requirement is little different. Your code is taking all the columns starting from VAR01 to last. But in our original dataset, we are having some other variables also. e.g.: "ID","Gender","Race","Country","VAR01","VAR02","VAR04","VAR05","VAR06","VAR08","VAR09","VAR13","VAR14","PREF01","PREF02","PREF04","PREF05". Now we want to concatenate only the variables starting with substring "VAR". Would like to know is there any way. – K.Dilip Kumar Feb 19 '18 at 07:26
  • See revised version. – G. Grothendieck Feb 19 '18 at 12:42