1

Relatively new to R, but I have an issue combining columns that have the same name. I have a very large dataframe (~70 cols and 30k rows). Some of the columns have the same name. I wish to merge these columns and remove the NA's.

An example of what I would like is below (although on a much larger scale).

df <- data.frame(x = c(2,1,3,5,NA,12,"blah"),
                 x = c(NA,NA,NA,NA,9,NA,NA),
                 y = c(NA,5,12,"hop",NA,2,NA),
                 y = c(2,NA,NA,NA,8,NA,4),
                 z = c(9,5,NA,3,2,6,NA))
desired.result <- data.frame(x = c(2,1,3,5,9,12,"blah"), 
                             y = c(2,5,12,"hop",8,2,4),
                             z = c(9,5,NA,3,2,6,NA))

I have tried a number of things including suggestions such as: R: merging columns and the values if they have the same column name Combine column to remove NA's

However, these solutions either require a numeric dataset (I need to keep the character information) or they require you to manually input the columns that are the same (which is too time consuming for the size of my dataset).

I have managed to solve the issue manually by creating new columns that are combinations:

df$x <- apply(df[,1:2], 1, function(x) x[!is.na(x)][1])

However I don't know how to get R to auto-identify where the columns have the same names and then apply something like the above such that I don't need to specify the index each time.

Thanks

  • Why have you a df with duplicated columns ? This isn't supposed to happen, if you use mainstream data importing packages – Arnaud Feldmann Mar 23 '22 at 11:35
  • I wouldn't say they are duplicated, they just have the same name (or have the name and .1 or .2 after them sequentially). But, the data in the columns is different. – Sean Wheeler Mar 23 '22 at 11:43
  • Is that the actual naming convention though? As in, field names will have a `.1` or `.2` appended? – Davide Lorino Mar 23 '22 at 11:46
  • Apologies, I should have clarified this in the question. A solution to either would be helpful. I.e. I can change them to the same using colnames(df) <- colnames(df) %>% gsub("\\.[0-9]$", "", .) If they are the same name I figured it would be easier to identify them using some sort of variation on 'duplicates()' but equally if they can be combined via the the .1 or .2 naming convention that would be perfect too. thanks – Sean Wheeler Mar 23 '22 at 11:53
  • Firstly, you have to understand that you can't mix numbers and characters. If you do, all the column is converted to character, and that's not a good choice. Here, you are obliged to convert everything to character if you want to merge things – Arnaud Feldmann Mar 23 '22 at 12:04
  • Ah that would be OK though? Once exported they would be again recognised as numbers e.g. in Excel ? thanks – Sean Wheeler Mar 23 '22 at 12:07
  • @SeanWheeler it depends on the package that you use, but most packages would expect a character column to export to a character in excel. Characters aren't made to store numbers – Arnaud Feldmann Mar 23 '22 at 12:10
  • Ah OK that is something that I will have to watch out for. Thanks for the advice. I appreciate it. – Sean Wheeler Mar 23 '22 at 12:21

1 Answers1

2

here is a base R approach

#split into a named list, nased on colnames befote the .-character
L <- split.default(df, f = gsub("(.*)\\..*", "\\1", names(df)))
#get the first non-na value for each row in each chunk
L2 <- lapply(L, function(x) apply(x, 1, function(y) na.omit(y)[1]))
# result in a data.frame
as.data.frame(L2)

#      x   y  z
# 1    2   2  9
# 2    1   5  5
# 3    3  12 NA
# 4    5 hop  3
# 5    9   8  2
# 6   12   2  6
# 7 blah   4 NA

# since you are using mixed formats, the columsn are not of the same class!!
str(as.data.frame(L2))
# 'data.frame': 7 obs. of  3 variables:
# $ x: chr  "2" "1" "3" "5" ...
# $ y: chr  " 2" "5" "12" "hop" ...
# $ z: num  9 5 NA 3 2 6 NA
Wimpel
  • 26,031
  • 1
  • 20
  • 37