I need to create a variable that selects, among some specific columns in a dataset, the one that is closest to another specific column

Question

I have a dataset which is similar to this:

data= data.frame(a=c(33,44,55), b= c(99,77,NA,66), 
      var1=c(1,2,3,NA),var2=c(5,6,NA,7),var3=c(8,9,10,NA), x = c(6,5,4,3))

I need to create a column which yelds, for each row, the value among columns var1,var2 and var3 that is closest to column x, ignoring the NA's in var1:var3.

Something like:

closest_x
  5
  6
  3
  7

In my real problem, I have many more columns than this, so I'd like to use starts_with to select the columns to be compared with X (the columns represented as "var1", etc. above).

I've tried creating columns with the modular difference between the column X and the "var" columns, then I've tried something like:

data %>% mutate(pmin = pmin(starts_with("var")))

or

mutate(data, C = pmin(starts_with("var")))

and also

data %>% with(pmin(starts_with("var")))

It says the variable context is not set. Besides that, it would be better if I don't have to create many other variables with this modular difference, and go straight to the closest value to column X.

I've found some very close to what I need in this post: Closest value to a specific column in R

However, I don't know how to apply something similar to my problem due to this fact that I have many more columns and I want to select only those that start with a specific word.

EDIT: I need NA's in the variables to be compared to "x" to be ignored.

EDIT 2: The code with my real dataset worked fine in the past. Now I tried to run it again and it didn't work properly. I tried to find what has changed, or even whether any package has changed, but it doesn't seem to be the case.

Below there is a code to produce a small sample of my real data. Instead of var1, var2, etc. I have ideolparty_A:ideolparty_I and instead of x (the variable to compare with) I have ideol_self.

The solution with max.col was working until a few months ago, with this code:


temp_df <- -abs(cses_pr[cols] - cses_pr$ideol_self)
cses_pr$closest <- cses_pr[cols][cbind(1:nrow(cses_pr), 
                                       max.col(replace(temp_df, is.na(temp_df), -Inf)))]

But now it yields the following code: Error: Subscript `cbind(...)` is a matrix, it must be of type logical. before I can run the last line of code:

cses_pr <- cses_pr %>% mutate (cong_closest = abs(closest-ideol_self))

structure(list(election = c("PER_2000", "PER_2006", "PER_2006", 
"USA_2008", "MEX_2012", "ROU_1996", "MEX_2012", "TWN_2008", "USA_1996", 
"PER_2016", "ARG_2015", "FRA_2012", "MEX_2012", "SRB_2012", "USA_1996", 
"ROU_2014", "ROU_2004", "ROU_2009", "RUS_2000", "ROU_2014", "CHL_1999", 
"BRA_2006", "RUS_2004", "BRA_2002", "TWN_2012", "MEX_2012", "TWN_2008", 
"SRB_2012", "USA_2004", "BRA_2002", "PER_2000", "USA_2008", "ARG_2015", 
"FRA_2012", "PHL_2016", "TWN_2012", "LTU_1997", "URY_2009", "BRA_2006", 
"PER_2006", "MEX_2012", "CHL_1999", "BRA_2010", "PER_2016", "MEX_2000", 
"BRA_2002", "PER_2011", "ROU_2009", "FRA_2012", "TWN_2012", "FRA_2002", 
"PER_2000", "CHL_1999", "PER_2011", "MEX_2006", "ROU_2009", "ROU_1996", 
"BRA_2014", "ROU_1996", "ROU_2014", "ROU_2014", "FRA_2012", "PER_2016", 
"MEX_2006", "USA_2012", "ROU_2009", "ROU_2009", "BRA_2014", "KEN_2013", 
"PHL_2016", "BLR_2001", "BRA_2006", "PER_2016", "FRA_2012", "CHL_2005", 
"CHL_2009", "LTU_1997", "RUS_2000", "ROU_2014", "TWN_2012", "BRA_2006", 
"USA_2008", "USA_2004", "MEX_2012", "ROU_2004", "TWN_2012", "BRA_2014", 
"USA_2008", "TWN_2004", "PER_2000", "MEX_2006", "PHL_2004", "BRA_2002", 
"PER_2011", "CHL_2005", "PER_2006", "RUS_2000", "ARG_2015", "BRA_2010", 
"TWN_2012", "MEX_2006", "ARG_2015", "BRA_2014", "TWN_2004", "BRA_2006", 
"PER_2016", "PHL_2016", "URY_2009", "RUS_2000", "PER_2006", "FRA_2002", 
"BRA_2002", "KEN_2013", "RUS_2004", "PER_2006", "TWN_2012", "PER_2011", 
"PHL_2010", "PER_2006", "FRA_2012", "PHL_2016", "MEX_2000", "RUS_2000", 
"TWN_2004", "BRA_2002", "ARG_2015", "FRA_2012"), ideol_self = c(10, 
NA, 0, 6, 10, NA, 5, 5, 8, 2, 5, 5, 3, NA, 3, 5, 5, 10, 5, NA, 
10, 3, 6, 6, NA, NA, 5, 10, 5, 5, NA, NA, NA, 2, 5, NA, 10, 8, 
5, 6, 10, 5, 10, 0, 10, 3, NA, 9, 5, NA, 10, 6, 5, 7, NA, 6, 
NA, NA, NA, 9, NA, 2, 9, 10, 10, NA, 5, 7, NA, 8, NA, 8, NA, 
5, 6, 0, 6, 0, 7, NA, NA, 3, 2, NA, 7, NA, 4, 1, 4, NA, 6, 6, 
NA, 4, NA, 10, 5, 9, NA, NA, 1, 5, NA, 5, 3, 7, 3, 3, 0, 8, 4, 
0, 5, 6, 5, NA, 6, 10, NA, 7, 7, NA, 3, NA, NA, 4, 1), ideolparty_A = c(5, 
5, 0, 7, 10, NA, NA, 5, NA, 2, 3, 2, 9, 9, NA, 9, 0, 10, NA, 
NA, NA, 6, 7, 2, NA, 9, NA, 8, 7, 6, 5, NA, NA, 0, 8, NA, NA, 
2, NA, 5, 10, NA, 0, NA, 0, 4, NA, 8, 2, NA, 5, 3, NA, 3, 10, 
6, NA, NA, NA, 2, NA, 4, 10, 0, 10, NA, 10, NA, NA, 6, NA, 4, 
NA, 3, 10, 10, NA, NA, 1, NA, NA, 6, 10, NA, 3, NA, NA, 1, 2, 
NA, 8, 6, 3, 3, NA, 7, NA, 9, 6, NA, 10, 4, NA, 3, 7, 6, 5, 3, 
NA, 1, 7, 1, 10, 7, NA, NA, 0, 0, 2, 1, 9, NA, NA, NA, 8, 5, 
1), ideolparty_B = c(9, 5, 10, 5, 1, NA, NA, 5, NA, 7, 6.5, 8, 
1, 5, NA, 5, 10, 0, NA, NA, NA, 6, 2, 7, NA, 9, NA, 6, 5, 4, 
8, NA, NA, 10, 10, NA, NA, 9, NA, 4, 10, NA, 10, NA, 0, 6, NA, 
9, 5, NA, 10, 0, NA, 5, 6, 3, NA, NA, NA, 9, NA, 8, 6, 0, 0, 
NA, 0, NA, NA, 7, NA, 2, NA, 7, 8, 10, NA, NA, 10, NA, NA, 4, 
4, NA, 8, NA, NA, 10, 8, NA, 4, 7, NA, 5, NA, 8, NA, 2.5, 7, 
NA, 0, 8.5, NA, 5, 1, 8, 4, 10, NA, 10, 10, 6, 4, 0, NA, NA, 
4, 10, 0, 8, 1, NA, NA, NA, 10, 8.5, 8), ideolparty_C = c(7, 
7, 10, NA, 1, NA, NA, NA, NA, 2, 5, 3, 0, 0, NA, 8, 10, 0, NA, 
NA, NA, 6, 2, 0, NA, 2, NA, 2, NA, 4, 4, NA, NA, 7, NA, NA, 10, 
5, NA, 4, 0, NA, 7, 0, 10, 2, NA, 9, 10, NA, 3, NA, NA, 5, 10, 
7, NA, NA, NA, 3, NA, 10, 0, 10, NA, NA, 10, NA, NA, NA, NA, 
8, NA, 8, 6, 5, 8, NA, NA, NA, NA, NA, 9, NA, 9, NA, NA, NA, 
7, NA, 5, 6, NA, 7, NA, 0, NA, 4, 3, NA, 0, 4, NA, 6, 7, 0, NA, 
10, NA, 1, 5, NA, 8, 0, NA, NA, 7, 10, 8, 10, NA, NA, NA, NA, 
NA, 6, 10), ideolparty_D = c(7, 6, NA, NA, NA, NA, NA, NA, NA, 
5, NA, 3, 9, 6, NA, NA, 0, 0, NA, NA, NA, 6, 4, 8, NA, 9, NA, 
5, NA, 4, 3, NA, NA, 4, 3, NA, 4, NA, NA, 1, 10, NA, NA, NA, 
10, 7, NA, 3, 2, NA, 7, 0, NA, 6, 7, 0, NA, NA, NA, 2, NA, 2, 
9, 0, NA, NA, 5, NA, NA, 7, NA, 6, NA, 3, 10, 5, 6, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, 3, NA, 5, 5, NA, 7, NA, 0, NA, 
NA, NA, NA, 0, NA, NA, 4, 10, 8, 5, 10, NA, 1, 9, 2, 2, 5, NA, 
NA, 10, 10, NA, 1, 0, NA, NA, NA, NA, NA, 0), ideolparty_E = c(5, 
5, 0, NA, 1, NA, NA, NA, NA, NA, NA, 5, 0, NA, NA, 9, 10, 10, 
NA, NA, NA, 6, 4, NA, NA, 2, NA, 1, NA, NA, 4, NA, NA, 5, 3, 
NA, 8, NA, NA, 0, 0, NA, 10, NA, 0, NA, NA, 6, 5, NA, NA, 0, 
NA, 5, 5, NA, NA, NA, NA, 3, NA, NA, NA, 0, NA, NA, 5, NA, NA, 
7, NA, 4, NA, 4, 5, 2, 6, NA, 10, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 3, NA, 2, 4, NA, 7, NA, 8, NA, 5, NA, NA, 0, 7, NA, 3, 
5, NA, 4, 3, NA, 2, 1, NA, NA, 10, NA, NA, 5, 0, 0, 2, 9, NA, 
NA, NA, NA, 4, 8), ideolparty_F = c(7, 5, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, 5, 0, 4, NA, 1, 10, NA, NA, NA, NA, 6, 4, NA, 
NA, 8, NA, 7, NA, NA, 6, NA, NA, 5, 4, NA, NA, NA, NA, NA, 10, 
NA, NA, NA, 0, NA, NA, NA, 5, NA, NA, 3, NA, 7, 8, NA, NA, NA, 
NA, 2, NA, 5, 6, 0, NA, NA, NA, NA, NA, 6, NA, 8, NA, 6, 1, NA, 
NA, NA, 6, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 5, 5, NA, 
10, NA, 0, NA, NA, NA, NA, 0, NA, NA, NA, 7, 3, 3, NA, NA, 1, 
7, NA, NA, 5, NA, NA, 2, 5, NA, 1, 2, NA, NA, NA, NA, NA, 2), 
    ideolparty_G = c(NA, 7, NA, NA, NA, NA, NA, NA, NA, NA, 7, 
    NA, 0, 7, NA, NA, NA, 0, NA, NA, NA, NA, NA, 7, NA, 2, NA, 
    0, NA, 4, NA, NA, NA, NA, NA, NA, 4, NA, NA, NA, 0, NA, NA, 
    NA, NA, 6, NA, 8, NA, NA, 2, NA, NA, NA, 8, NA, NA, NA, NA, 
    NA, NA, NA, 4, 0, NA, NA, 5, NA, NA, NA, NA, NA, NA, NA, 
    NA, 1, 6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, 0, NA, 0, NA, NA, 0, 10, NA, NA, 
    NA, 2, NA, NA, NA, 1, 3, 6, NA, NA, NA, NA, NA, NA, 0, NA, 
    NA, NA, NA, NA, 10, 8, NA), ideolparty_H = c(NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, 6, NA, NA, NA, NA, NA, 0, NA, 
    NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 3, NA, NA, NA, 
    NA, NA, NA, 0, NA, NA, NA, NA, NA, 0, NA, NA, 1, NA, NA, 
    NA, NA, 5, NA, NA, NA, 7, NA, NA, NA, NA, NA, NA, NA, NA, 
    0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, 5, 3, NA, 0, 7, NA, NA, NA, NA, NA, NA, NA, 
    NA, 8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, 9, NA), ideolparty_I = c(NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 10, NA, 2, 
    NA, NA, NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, 0, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, 7, NA, NA, NA, NA, NA, NA, 10, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    9, NA, NA, NA, 4, NA, NA, NA, NA, 5, NA, NA, NA, 1, NA, NA, 
    NA, NA, NA, NA, 4, NA, NA, 2, NA, NA, NA, NA, 6, NA)), row.names = c(NA, 
-127L), class = c("tbl_df", "tbl", "data.frame"))

Ronak Shah · Accepted Answer · 2019-07-02T02:14:21.170

4

Here is one vectorized way using max.col

cols <- grep("^var", names(data))
data$closest_x <- data[cols][cbind(1:nrow(data), 
                      max.col(-abs(data[cols] - data$x)))]

#   a  b var1 var2 var3  x closest_x
#1 33 99   24   15   45 11        15
#2 44 77   12   30   27 22        27
#3 55 66   76   20   15 33        20

Or using apply

data$closest_x <- apply(data, 1, function(p) 
                  p[cols][which.min(abs(p[cols] - p["x"]))])

If there are NA values in the data we can replace them with -Inf and then subset

temp_df <- -abs(data[cols] - data$x)
data$closest_x <- data[cols][cbind(1:nrow(data), 
                   max.col(replace(temp_df, is.na(temp_df), -Inf)))]

edited Jul 02 '19 at 02:14

answered Jul 01 '19 at 02:35

Ronak Shah

377,200
20
156
213

Thanks. First solution worked, but I forgot to mention that I have missing valuesI'd like to be ignored. Would that be hard to do, some sort of "na.rm" thing? I'm trying to combine that solution with the one I fond for [max.col with NA removal](https://stackoverflow.com/questions/39275212/max-col-with-na-removal?answertab=active#tab-top). But some sort of NA.rm would be better so I don't mess with the dataset. Second option didn't work ("non-numeric argument to binary operator"). – Guilherme Pires Arbache Jul 02 '19 at 02:04
@GuilhermePiresArbache yes, We can replace them with `-Inf` and then subset. can you check the updated answer? – Ronak Shah Jul 02 '19 at 02:15
I have tried to work with this code again, since I am resuming the project where I was using it, and unfortunately it is not working with my real dataset. Please see the edit above. Would there be any solution? Appreciate any help. – Guilherme Pires Arbache Feb 11 '21 at 16:52
@GuilhermePiresArbache It is difficult to remember context of answer 2 years down the line. Change `cses_pr` to dataframe and try the answer again. `cses_pr <- data.frame(cses_pr)` – Ronak Shah Feb 11 '21 at 23:35
Yes, that's why I tried to put everything exactly like it was in my new EDIT. Anyway, I can't believe that was the problem! For some reason some other transformation made it something different from data.frame. Thank you very much! – Guilherme Pires Arbache Feb 12 '21 at 00:54

Curt F. · Answer 2 · 2019-07-02T00:04:57.717

1

A "tidy" approach

A more "tidy" solution might be something along these lines.

data %>%

    # reshape data to long format w/ row numbers
    mutate(row = row_number()) %>%
    gather(col, val, starts_with('var')) %>%

    # compute the minimum difference row-by-row
    group_by(row) %>%
    summarize(closest_to_x = val[which.min(abs(val - x))]) %>%

    # the next two lines just take the new column and paste it back onto the original data
    select(closest_to_x) %>%
    bind_cols(data, .)

It is a bit verbose, but I find it fairly readable (YMMV of course). Not sure about performance. It doesn't use max.col() or pmin(), but relies on reformatting the data into a "tidy" format, where the values of all of the columns you care about are put into a single val column.

edited Jul 02 '19 at 00:04

answered Jul 01 '19 at 23:29

Curt F.

4,690
2
22
39

Thanks. Is there a way to make this work while ignoring NAs (please check my edit)? – Guilherme Pires Arbache Jul 02 '19 at 02:46
`which.min()` ignores NAs as a default. This code still works on your edited data. – Curt F. Jul 02 '19 at 15:57
Sorry for taking so long to answer. But this code doesn't work with my real data (see the edit above). It does work fine with the sample data. I'm having warnings like "Error: Can't recycle `..1` (size 4) to match `..2` (size 38390)." – Guilherme Pires Arbache Feb 10 '21 at 23:02

I need to create a variable that selects, among some specific columns in a dataset, the one that is closest to another specific column

2 Answers2

A "tidy" approach