1

I have a dataset which is similar to this:

data= data.frame(a=c(33,44,55), b= c(99,77,NA,66), 
      var1=c(1,2,3,NA),var2=c(5,6,NA,7),var3=c(8,9,10,NA), x = c(6,5,4,3))

I need to create a column which yelds, for each row, the value among columns var1,var2 and var3 that is closest to column x, ignoring the NA's in var1:var3.

Something like:

closest_x
  5
  6
  3
  7

In my real problem, I have many more columns than this, so I'd like to use starts_with to select the columns to be compared with X (the columns represented as "var1", etc. above).

I've tried creating columns with the modular difference between the column X and the "var" columns, then I've tried something like:

data %>% mutate(pmin = pmin(starts_with("var")))

or

mutate(data, C = pmin(starts_with("var")))

and also

data %>% with(pmin(starts_with("var")))

It says the variable context is not set. Besides that, it would be better if I don't have to create many other variables with this modular difference, and go straight to the closest value to column X.

I've found some very close to what I need in this post: Closest value to a specific column in R

However, I don't know how to apply something similar to my problem due to this fact that I have many more columns and I want to select only those that start with a specific word.

EDIT: I need NA's in the variables to be compared to "x" to be ignored.

EDIT 2: The code with my real dataset worked fine in the past. Now I tried to run it again and it didn't work properly. I tried to find what has changed, or even whether any package has changed, but it doesn't seem to be the case.

Below there is a code to produce a small sample of my real data. Instead of var1, var2, etc. I have ideolparty_A:ideolparty_I and instead of x (the variable to compare with) I have ideol_self.

The solution with max.col was working until a few months ago, with this code:


temp_df <- -abs(cses_pr[cols] - cses_pr$ideol_self)
cses_pr$closest <- cses_pr[cols][cbind(1:nrow(cses_pr), 
                                       max.col(replace(temp_df, is.na(temp_df), -Inf)))]

But now it yields the following code: Error: Subscript `cbind(...)` is a matrix, it must be of type logical. before I can run the last line of code:

cses_pr <- cses_pr %>% mutate (cong_closest = abs(closest-ideol_self))

structure(list(election = c("PER_2000", "PER_2006", "PER_2006", 
"USA_2008", "MEX_2012", "ROU_1996", "MEX_2012", "TWN_2008", "USA_1996", 
"PER_2016", "ARG_2015", "FRA_2012", "MEX_2012", "SRB_2012", "USA_1996", 
"ROU_2014", "ROU_2004", "ROU_2009", "RUS_2000", "ROU_2014", "CHL_1999", 
"BRA_2006", "RUS_2004", "BRA_2002", "TWN_2012", "MEX_2012", "TWN_2008", 
"SRB_2012", "USA_2004", "BRA_2002", "PER_2000", "USA_2008", "ARG_2015", 
"FRA_2012", "PHL_2016", "TWN_2012", "LTU_1997", "URY_2009", "BRA_2006", 
"PER_2006", "MEX_2012", "CHL_1999", "BRA_2010", "PER_2016", "MEX_2000", 
"BRA_2002", "PER_2011", "ROU_2009", "FRA_2012", "TWN_2012", "FRA_2002", 
"PER_2000", "CHL_1999", "PER_2011", "MEX_2006", "ROU_2009", "ROU_1996", 
"BRA_2014", "ROU_1996", "ROU_2014", "ROU_2014", "FRA_2012", "PER_2016", 
"MEX_2006", "USA_2012", "ROU_2009", "ROU_2009", "BRA_2014", "KEN_2013", 
"PHL_2016", "BLR_2001", "BRA_2006", "PER_2016", "FRA_2012", "CHL_2005", 
"CHL_2009", "LTU_1997", "RUS_2000", "ROU_2014", "TWN_2012", "BRA_2006", 
"USA_2008", "USA_2004", "MEX_2012", "ROU_2004", "TWN_2012", "BRA_2014", 
"USA_2008", "TWN_2004", "PER_2000", "MEX_2006", "PHL_2004", "BRA_2002", 
"PER_2011", "CHL_2005", "PER_2006", "RUS_2000", "ARG_2015", "BRA_2010", 
"TWN_2012", "MEX_2006", "ARG_2015", "BRA_2014", "TWN_2004", "BRA_2006", 
"PER_2016", "PHL_2016", "URY_2009", "RUS_2000", "PER_2006", "FRA_2002", 
"BRA_2002", "KEN_2013", "RUS_2004", "PER_2006", "TWN_2012", "PER_2011", 
"PHL_2010", "PER_2006", "FRA_2012", "PHL_2016", "MEX_2000", "RUS_2000", 
"TWN_2004", "BRA_2002", "ARG_2015", "FRA_2012"), ideol_self = c(10, 
NA, 0, 6, 10, NA, 5, 5, 8, 2, 5, 5, 3, NA, 3, 5, 5, 10, 5, NA, 
10, 3, 6, 6, NA, NA, 5, 10, 5, 5, NA, NA, NA, 2, 5, NA, 10, 8, 
5, 6, 10, 5, 10, 0, 10, 3, NA, 9, 5, NA, 10, 6, 5, 7, NA, 6, 
NA, NA, NA, 9, NA, 2, 9, 10, 10, NA, 5, 7, NA, 8, NA, 8, NA, 
5, 6, 0, 6, 0, 7, NA, NA, 3, 2, NA, 7, NA, 4, 1, 4, NA, 6, 6, 
NA, 4, NA, 10, 5, 9, NA, NA, 1, 5, NA, 5, 3, 7, 3, 3, 0, 8, 4, 
0, 5, 6, 5, NA, 6, 10, NA, 7, 7, NA, 3, NA, NA, 4, 1), ideolparty_A = c(5, 
5, 0, 7, 10, NA, NA, 5, NA, 2, 3, 2, 9, 9, NA, 9, 0, 10, NA, 
NA, NA, 6, 7, 2, NA, 9, NA, 8, 7, 6, 5, NA, NA, 0, 8, NA, NA, 
2, NA, 5, 10, NA, 0, NA, 0, 4, NA, 8, 2, NA, 5, 3, NA, 3, 10, 
6, NA, NA, NA, 2, NA, 4, 10, 0, 10, NA, 10, NA, NA, 6, NA, 4, 
NA, 3, 10, 10, NA, NA, 1, NA, NA, 6, 10, NA, 3, NA, NA, 1, 2, 
NA, 8, 6, 3, 3, NA, 7, NA, 9, 6, NA, 10, 4, NA, 3, 7, 6, 5, 3, 
NA, 1, 7, 1, 10, 7, NA, NA, 0, 0, 2, 1, 9, NA, NA, NA, 8, 5, 
1), ideolparty_B = c(9, 5, 10, 5, 1, NA, NA, 5, NA, 7, 6.5, 8, 
1, 5, NA, 5, 10, 0, NA, NA, NA, 6, 2, 7, NA, 9, NA, 6, 5, 4, 
8, NA, NA, 10, 10, NA, NA, 9, NA, 4, 10, NA, 10, NA, 0, 6, NA, 
9, 5, NA, 10, 0, NA, 5, 6, 3, NA, NA, NA, 9, NA, 8, 6, 0, 0, 
NA, 0, NA, NA, 7, NA, 2, NA, 7, 8, 10, NA, NA, 10, NA, NA, 4, 
4, NA, 8, NA, NA, 10, 8, NA, 4, 7, NA, 5, NA, 8, NA, 2.5, 7, 
NA, 0, 8.5, NA, 5, 1, 8, 4, 10, NA, 10, 10, 6, 4, 0, NA, NA, 
4, 10, 0, 8, 1, NA, NA, NA, 10, 8.5, 8), ideolparty_C = c(7, 
7, 10, NA, 1, NA, NA, NA, NA, 2, 5, 3, 0, 0, NA, 8, 10, 0, NA, 
NA, NA, 6, 2, 0, NA, 2, NA, 2, NA, 4, 4, NA, NA, 7, NA, NA, 10, 
5, NA, 4, 0, NA, 7, 0, 10, 2, NA, 9, 10, NA, 3, NA, NA, 5, 10, 
7, NA, NA, NA, 3, NA, 10, 0, 10, NA, NA, 10, NA, NA, NA, NA, 
8, NA, 8, 6, 5, 8, NA, NA, NA, NA, NA, 9, NA, 9, NA, NA, NA, 
7, NA, 5, 6, NA, 7, NA, 0, NA, 4, 3, NA, 0, 4, NA, 6, 7, 0, NA, 
10, NA, 1, 5, NA, 8, 0, NA, NA, 7, 10, 8, 10, NA, NA, NA, NA, 
NA, 6, 10), ideolparty_D = c(7, 6, NA, NA, NA, NA, NA, NA, NA, 
5, NA, 3, 9, 6, NA, NA, 0, 0, NA, NA, NA, 6, 4, 8, NA, 9, NA, 
5, NA, 4, 3, NA, NA, 4, 3, NA, 4, NA, NA, 1, 10, NA, NA, NA, 
10, 7, NA, 3, 2, NA, 7, 0, NA, 6, 7, 0, NA, NA, NA, 2, NA, 2, 
9, 0, NA, NA, 5, NA, NA, 7, NA, 6, NA, 3, 10, 5, 6, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, 3, NA, 5, 5, NA, 7, NA, 0, NA, 
NA, NA, NA, 0, NA, NA, 4, 10, 8, 5, 10, NA, 1, 9, 2, 2, 5, NA, 
NA, 10, 10, NA, 1, 0, NA, NA, NA, NA, NA, 0), ideolparty_E = c(5, 
5, 0, NA, 1, NA, NA, NA, NA, NA, NA, 5, 0, NA, NA, 9, 10, 10, 
NA, NA, NA, 6, 4, NA, NA, 2, NA, 1, NA, NA, 4, NA, NA, 5, 3, 
NA, 8, NA, NA, 0, 0, NA, 10, NA, 0, NA, NA, 6, 5, NA, NA, 0, 
NA, 5, 5, NA, NA, NA, NA, 3, NA, NA, NA, 0, NA, NA, 5, NA, NA, 
7, NA, 4, NA, 4, 5, 2, 6, NA, 10, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, 3, NA, 2, 4, NA, 7, NA, 8, NA, 5, NA, NA, 0, 7, NA, 3, 
5, NA, 4, 3, NA, 2, 1, NA, NA, 10, NA, NA, 5, 0, 0, 2, 9, NA, 
NA, NA, NA, 4, 8), ideolparty_F = c(7, 5, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, 5, 0, 4, NA, 1, 10, NA, NA, NA, NA, 6, 4, NA, 
NA, 8, NA, 7, NA, NA, 6, NA, NA, 5, 4, NA, NA, NA, NA, NA, 10, 
NA, NA, NA, 0, NA, NA, NA, 5, NA, NA, 3, NA, 7, 8, NA, NA, NA, 
NA, 2, NA, 5, 6, 0, NA, NA, NA, NA, NA, 6, NA, 8, NA, 6, 1, NA, 
NA, NA, 6, NA, NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, 5, 5, NA, 
10, NA, 0, NA, NA, NA, NA, 0, NA, NA, NA, 7, 3, 3, NA, NA, 1, 
7, NA, NA, 5, NA, NA, 2, 5, NA, 1, 2, NA, NA, NA, NA, NA, 2), 
    ideolparty_G = c(NA, 7, NA, NA, NA, NA, NA, NA, NA, NA, 7, 
    NA, 0, 7, NA, NA, NA, 0, NA, NA, NA, NA, NA, 7, NA, 2, NA, 
    0, NA, 4, NA, NA, NA, NA, NA, NA, 4, NA, NA, NA, 0, NA, NA, 
    NA, NA, 6, NA, 8, NA, NA, 2, NA, NA, NA, 8, NA, NA, NA, NA, 
    NA, NA, NA, 4, 0, NA, NA, 5, NA, NA, NA, NA, NA, NA, NA, 
    NA, 1, 6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, 0, NA, 0, NA, NA, 0, 10, NA, NA, 
    NA, 2, NA, NA, NA, 1, 3, 6, NA, NA, NA, NA, NA, NA, 0, NA, 
    NA, NA, NA, NA, 10, 8, NA), ideolparty_H = c(NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, 6, NA, NA, NA, NA, NA, 0, NA, 
    NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, 3, NA, NA, NA, 
    NA, NA, NA, 0, NA, NA, NA, NA, NA, 0, NA, NA, 1, NA, NA, 
    NA, NA, 5, NA, NA, NA, 7, NA, NA, NA, NA, NA, NA, NA, NA, 
    0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, 5, 3, NA, 0, 7, NA, NA, NA, NA, NA, NA, NA, 
    NA, 8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, 9, NA), ideolparty_I = c(NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 10, NA, 2, 
    NA, NA, NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, 0, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    NA, NA, 7, NA, NA, NA, NA, NA, NA, 10, NA, NA, NA, NA, NA, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
    9, NA, NA, NA, 4, NA, NA, NA, NA, 5, NA, NA, NA, 1, NA, NA, 
    NA, NA, NA, NA, 4, NA, NA, 2, NA, NA, NA, NA, 6, NA)), row.names = c(NA, 
-127L), class = c("tbl_df", "tbl", "data.frame"))

2 Answers2

4

Here is one vectorized way using max.col

cols <- grep("^var", names(data))
data$closest_x <- data[cols][cbind(1:nrow(data), 
                      max.col(-abs(data[cols] - data$x)))]

#   a  b var1 var2 var3  x closest_x
#1 33 99   24   15   45 11        15
#2 44 77   12   30   27 22        27
#3 55 66   76   20   15 33        20

Or using apply

data$closest_x <- apply(data, 1, function(p) 
                  p[cols][which.min(abs(p[cols] - p["x"]))])

If there are NA values in the data we can replace them with -Inf and then subset

temp_df <- -abs(data[cols] - data$x)
data$closest_x <- data[cols][cbind(1:nrow(data), 
                   max.col(replace(temp_df, is.na(temp_df), -Inf)))]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks. First solution worked, but I forgot to mention that I have missing valuesI'd like to be ignored. Would that be hard to do, some sort of "na.rm" thing? I'm trying to combine that solution with the one I fond for [max.col with NA removal](https://stackoverflow.com/questions/39275212/max-col-with-na-removal?answertab=active#tab-top). But some sort of NA.rm would be better so I don't mess with the dataset. Second option didn't work ("non-numeric argument to binary operator"). – Guilherme Pires Arbache Jul 02 '19 at 02:04
  • @GuilhermePiresArbache yes, We can replace them with `-Inf` and then subset. can you check the updated answer? – Ronak Shah Jul 02 '19 at 02:15
  • I have tried to work with this code again, since I am resuming the project where I was using it, and unfortunately it is not working with my real dataset. Please see the edit above. Would there be any solution? Appreciate any help. – Guilherme Pires Arbache Feb 11 '21 at 16:52
  • @GuilhermePiresArbache It is difficult to remember context of answer 2 years down the line. Change `cses_pr` to dataframe and try the answer again. `cses_pr <- data.frame(cses_pr)` – Ronak Shah Feb 11 '21 at 23:35
  • Yes, that's why I tried to put everything exactly like it was in my new EDIT. Anyway, I can't believe that was the problem! For some reason some other transformation made it something different from data.frame. Thank you very much! – Guilherme Pires Arbache Feb 12 '21 at 00:54
1

A "tidy" approach

A more "tidy" solution might be something along these lines.

data %>%

    # reshape data to long format w/ row numbers
    mutate(row = row_number()) %>%
    gather(col, val, starts_with('var')) %>%

    # compute the minimum difference row-by-row
    group_by(row) %>%
    summarize(closest_to_x = val[which.min(abs(val - x))]) %>%

    # the next two lines just take the new column and paste it back onto the original data
    select(closest_to_x) %>%
    bind_cols(data, .)

It is a bit verbose, but I find it fairly readable (YMMV of course). Not sure about performance. It doesn't use max.col() or pmin(), but relies on reformatting the data into a "tidy" format, where the values of all of the columns you care about are put into a single val column.

Curt F.
  • 4,690
  • 2
  • 22
  • 39