Coalesce columns with capitalised and non capitalised versions of variables names without specifying the variable names

Question

I have a (large) data frame as follows:

library(data.table)
DT <- fread(
"ID country year A b B a
4   NLD   2002  NA   1   NA   0
5   NLD   2002  NA   0   NA   1
6   NLD   2006  NA   1   NA   1
7   NLD   2006  NA   0   NA   0
8   NLD   2006  0   NA   0   NA
9   GBR   2002  0   NA   0   NA
10  GBR   2002  0   NA   0   NA
11  GBR   2002  0   NA   0   NA
12  GBR   2006  1   NA   1   NA
13  GBR   2006  1   NA   0   NA",
header = TRUE)

I would simply like to merge variables A and a, and, B and b.

EDIT: The problem is that I have to do this for more than 1000 variables, so I would like to avoid specifying either the column names which need not to be checked or the ones that do.

I was hoping for a solution that first splits the columns into a group for which there is no non-capitalised alternative, and a group for which there is.

As far as I understand the solution here:

Coalesce columns based on pattern in R

It still requires to provide the variables names for which the case needs to be checked. If I misunderstand this solution, which is very much possible, please let me know. In any case, as explained, I need a solution without specifically specifying the variables.

I found a good start here.

That solution has however a slightly different approach than the one I need.

How do I make such a variable merge conditional on something like tolower(varname) == varname ?

Desired output:

DT <- fread(
"ID country year A B
4   NLD   2002  0  1
5   NLD   2002  1  0
6   NLD   2006  1  1
7   NLD   2006  0  0 
8   NLD   2006  0  0 
9   GBR   2002  0  0 
10  GBR   2002  0  0
11  GBR   2002  0  0
12  GBR   2006  1  1 
13  GBR   2006  1  0 ",
header = TRUE)

I thought I gave you a similar target for this. Did you try it? — Sotos, Mar 10 '20 at 14:37
But the answer you cite doesn't require you to specify each variable, it uses regex for that? How is that not the solution you need? — dario, Mar 10 '20 at 14:39
I looked at the target. Apparently I do not understand the solution well enough. I figured that: `cols = grep("*PID*|*PIN*|*PARCEL*",colnames(sample),ignore.case = TRUE, value = TRUE)` still requires you to specify the column names (`PID*|*PIN*|*PARCEL`), for which you then make sure the case does not matter, while I want to check this for all columns without specifying. I will take another look at it and try to figure it out with the target link. If I figure it out I will close this question. If you could hint me in the right direction, I will close the question afterwards as well. — Tom, Mar 10 '20 at 14:45
I think you got good alternative answers below. Feel free to edit your question if you're still not sure how to do this... It's not always as easy to understand an answer as it is to provide one ;) — dario, Mar 10 '20 at 15:00

John J. · Answer 1 · 2020-03-10T15:46:21.343

3

I can offer a solution using tidyverse functions. This is essentially the same as the solution offered by AntoniosK, but pivot_longer and pivot_wider are the preferred alternatives to spread and gather.

library(dplyr)
library(tidyr)

DT %>% 
    mutate(UNIQUEID = row_number()) %>% 
    mutate_all(as.character) %>% 
    pivot_longer(cols = -UNIQUEID) %>% 
    mutate(name = stringr::str_to_upper(name)) %>% 
    filter(!is.na(value)) %>% 
    pivot_wider(names_from = name, values_from = value) %>% 
    type.convert(as.is=TRUE) %>% select(-UNIQUEID)

h/t @dario for the great suggestions.

edited Mar 10 '20 at 15:46

answered Mar 10 '20 at 14:51

John J.

1,450
1
13
28

1

A minor comment: OP wished the result to use the upper case column names... but I'm just nitpicking now ;) – dario Mar 10 '20 at 15:01
Hi John, thank you very much for your answer. I just posted a comment under AntoniosK's solution. If I understand your solution correctly, it also has the problem that I still need to specify which columns to check right? – Tom Mar 10 '20 at 15:02
@dario Hahaha, that is the least of my problems! – Tom Mar 10 '20 at 15:02
1

Hi @Tom. Unless I'm mistaken (quite possible), this will still work for you. Make sure you have a column that uniquely identifies every row (you can make this yourself), and explicitly don't include that column `pivot_longer(cols = -uniqueID)`. Then, when you pivot_longer and pivot_wider the columns with no match will be unchanged, while the columns with names that vary only by case will be merged. – John J. Mar 10 '20 at 15:21
1

Aah okay, I understand now. I don't specify the columns not to be checked, but only the one that uniquely identify. Thank you for your patience! – Tom Mar 10 '20 at 15:24
What about the integer and character columns? Won't they interfere with pivot_longer? – dario Mar 10 '20 at 15:25
good point, @dario. I've added a line to convert everything to character before pivot_longer(). – John J. Mar 10 '20 at 15:29
1

If we add ` %>% type.convert(as.is=TRUE)` at the end we could get the numeric types back. Nitpicking, again... – dario Mar 10 '20 at 15:32
Excellent solution! – dario Mar 10 '20 at 15:33
1

Could it be that you didn't update the code above with all your improvements? Shouldnt it be `DT %>% mutate(UNIQUEID = row_number()) %>% mutate_all(as.character) %>% pivot_longer(cols = -UNIQUEID) %>% mutate(name = stringr::str_to_upper(name)) %>% filter(!is.na(value)) %>% pivot_wider(names_from = name, values_from = value) %>% type.convert(as.is=TRUE) %>% select(-UNIQUEID)` – dario Mar 10 '20 at 15:38
1

Thanks for all the useful suggestions, dario. I've added them above, giving credit to you. – John J. Mar 10 '20 at 15:47
I'd say: We all did an awesome job ;) – dario Mar 10 '20 at 15:52
Thank you for your help and patience! – Tom Mar 11 '20 at 20:29
In the end I accepted Uwe's answer because I kept running into some variable name issues when applying this solution to my actual data set. For some reason this is not an issue with the data.table solutions. But again, thank you very much for your help. – Tom Mar 11 '20 at 20:34

s_baldur · Answer 2 · 2020-03-10T16:07:48.430

2

A data.table-only solution - using a simple loop instead of reshaping data:

all_cols <- names(DT)
cols <- grep("[A-Z]", all_cols, value = TRUE)
for (col in cols) {
  snc <- all_cols[all_cols == tolower(col)]
  if (length(snc)) {
    DT[, (col) := fcoalesce(.SD), .SDcols = c(snc, col)]
    DT[, (setdiff(snc, col)) := NULL]
  }
}


> DT[]
    ID country year A B
 1:  4     NLD 2002 0 1
 2:  5     NLD 2002 1 0
 3:  6     NLD 2006 1 1
 4:  7     NLD 2006 0 0
 5:  8     NLD 2006 0 0
 6:  9     GBR 2002 0 0
 7: 10     GBR 2002 0 0
 8: 11     GBR 2002 0 0
 9: 12     GBR 2006 1 1
10: 13     GBR 2006 1 0

edited Mar 10 '20 at 16:07

answered Mar 10 '20 at 16:02

s_baldur

29,441
4
36
69

It's good you have posted a `data.table` solution without reshaping. However, there is room for improvement because it will not coalesce more than two columns, e.g., `Cc, cc, cC, CC`. – Uwe Mar 10 '20 at 18:35
@Uwe, right I tailored this to the problem OP presents. It wouldn't be too hard to adopt the code to accommodate your example. – s_baldur Mar 11 '20 at 16:23
@sindri_baldur Thank you very much, works very well on my actual data base. – Tom Mar 11 '20 at 20:30

AntoniosK · Answer 3 · 2020-03-10T15:36:04.543

Assuming that your example dataset represents your general case, this should work:

library(data.table)
library(tidyverse)

DT <- fread(
  "ID country year A b B a
4   NLD   2002  NA   1   NA   0
5   NLD   2002  NA   0   NA   1
6   NLD   2006  NA   1   NA   1
7   NLD   2006  NA   0   NA   0
8   NLD   2006  0   NA   0   NA
9   GBR   2002  0   NA   0   NA
10  GBR   2002  0   NA   0   NA
11  GBR   2002  0   NA   0   NA
12  GBR   2006  1   NA   1   NA
13  GBR   2006  1   NA   0   NA",
  header = TRUE)

# spot the column names to keep as they are
data.frame(x = names(DT), stringsAsFactors = F) %>%  # get actual column names of the dataset
  mutate(y = toupper(x)) %>%                         # get the upper values
  group_by(y) %>%                                    # for each upper value
  filter(n() == 1) %>%                               # count them and keep only the unique columns
  pull(x) -> fix_cols                                # store unique column names

DT %>%
  gather(col_name, value, -fix_cols) %>%             # reshape dataset
  mutate(col_name = toupper(col_name)) %>%           # change column names to upper case
  na.omit() %>%                                      # remove NA rows
  spread(col_name, value)                            # reshape again

#    ID country year A B
# 1   4     NLD 2002 0 1
# 2   5     NLD 2002 1 0
# 3   6     NLD 2006 1 1
# 4   7     NLD 2006 0 0
# 5   8     NLD 2006 0 0
# 6   9     GBR 2002 0 0
# 7  10     GBR 2002 0 0
# 8  11     GBR 2002 0 0
# 9  12     GBR 2006 1 1
# 10 13     GBR 2006 1 0

Hi @AntoniosK, thank you very much for your answer. Is there anyway that I do not have the specify the columns: `ID, country, year`? It poses exactly the same problem as specifying which columns to check for. I would need still need to manually check for which columns need to be put together and which not (for more than 1000 variables). I was trying to find a solution which for every column names checks whether there is a non capitalized version and applies the solution only to those columns. — Tom, Mar 10 '20 at 15:00

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

The OP is using data.table, so the question deserves a data.table answer.

The approach below is similar to sindri_baldur's answer in general but differs in important details. In particular,

it will also coalesce multiple columns like "CC", "cc", "cC" covering the different ways of writing variable names, e.g., upper case, lower case, as well as lower and upper camel case.
it will return a description of the columns which have been coalesced.

library(data.table)
library(magrittr) # piping is used to improve readability
names(DT) %>% 
  data.table(orig = ., lc = tolower(.)) %>% 
  .[, {
    if (.N > 1L) {
      new <- toupper(.BY)
      old <- setdiff(orig, new)
      DT[, (new) := fcoalesce(.SD), .SDcols = orig]
      DT[, (old) := NULL]
      sprintf("Coalesced %s onto %s", toString(old), new)
    }
  }, by = lc]
DT[]

   lc                 V1
1:  a Coalesced a onto A
2:  b Coalesced b onto B

DT[]

    ID country year A B
 1:  4     NLD 2002 0 1
 2:  5     NLD 2002 1 0
 3:  6     NLD 2006 1 1
 4:  7     NLD 2006 0 0
 5:  8     NLD 2006 0 0
 6:  9     GBR 2002 0 0
 7: 10     GBR 2002 0 0
 8: 11     GBR 2002 0 0
 9: 12     GBR 2006 1 1
10: 13     GBR 2006 1 0

For another use case

DT2 <- fread(
  "ID country year A b B a CC cc cC
4   NLD   2002  NA   1   NA   0   1  NA  NA    
5   NLD   2002  NA   0   NA   1  NA   2  NA
6   NLD   2006  NA   1   NA   1  NA  NA   3
7   NLD   2006  NA   0   NA   0  NA  NA  NA  
8   NLD   2006  0   NA   0   NA   1  NA  NA
9   GBR   2002  0   NA   0   NA  NA   2  NA
10  GBR   2002  0   NA   0   NA  NA  NA   3
11  GBR   2002  0   NA   0   NA   1  NA  NA
12  GBR   2006  1   NA   1   NA  NA   2  NA
13  GBR   2006  1   NA   0   NA  NA  NA   3",
  header = TRUE)
DT <- copy(DT2)

above code returns

   lc                       V1
1:  a       Coalesced a onto A
2:  b       Coalesced b onto B
3: cc Coalesced cc, cC onto CC

DT[]

    ID country year A B CC
 1:  4     NLD 2002 0 1  1
 2:  5     NLD 2002 1 0  2
 3:  6     NLD 2006 1 1  3
 4:  7     NLD 2006 0 0 NA
 5:  8     NLD 2006 0 0  1
 6:  9     GBR 2002 0 0  2
 7: 10     GBR 2002 0 0  3
 8: 11     GBR 2002 0 0  1
 9: 12     GBR 2006 1 1  2
10: 13     GBR 2006 1 0  3

Explanation

The column names are turned into a data.table with a additional column lc of the lower case versions of the column names.
Instead of a for loop we use grouping by = and data.table's feature to evaluate any expression, even with side effects. So, DT is updated by reference for each distinct value of lc within the scope of the data.table which was created on-the-fly in step 1 but only if there is more than one column in the group.

Future extensions

This approach can be extended to coalesce columns which use underscore, dots, or blanks "_", ".", " "in its column names, e.g., "var_1", "VAR.1", "Var 1".

Coalesce columns with capitalised and non capitalised versions of variables names without specifying the variable names

4 Answers4

Explanation

Future extensions

Linked