R: How to locate a column in a large dataframe by using information from another dataframe with less rows

Question

I have a data frame (A) with a column containing some info. I have a larger data frame (B) that contains a column with similar information and I need to detect which column that contains the same data as the column in dataframeA. Because the dataframeB is large, it will be time-consuming to manually look through it though to identify the column. Is there a way that I can use the information from column 'some_info' in DataframeA to find the corresponding column in DataframeB where the information is contained?


dataframeA <- data.frame(some_info = c("a","b","c","d","e") )

dataframeB <- data.frame(id = 1:8, column_to_be_identified = c("a","f","b","c","g", "d","h", "e"), "column_almost_similar_but_not_quite" =c("a","f","b","c","g", "3","h", "e")  )

Basically: Is it possible to create a function or something similar that looks through dataframeB and detects the column(s) that contains exactly the information from the column in dataframeA?

Thanks a lot in advance!

mnist · Accepted Answer · 2021-09-17T10:11:05.530

If I understand correctly and you just want to receive the column name:

dataframeA <- data.frame(some_info = as.POSIXct(Sys.Date() - 1:5))
dataframeA
#>             some_info
#> 1 2021-09-16 02:00:00
#> 2 2021-09-15 02:00:00
#> 3 2021-09-14 02:00:00
#> 4 2021-09-13 02:00:00
#> 5 2021-09-12 02:00:00
class(dataframeA$some_info)
#> [1] "POSIXct" "POSIXt"
dataframeB <- data.frame(id = 1:10, 
                         column_to_be_identified = as.POSIXct(Sys.Date() - 1:10),
                         column_almost_similar_but_not_quite = as.POSIXct(Sys.Date() - 6:15) )
dataframeB
#>    id column_to_be_identified column_almost_similar_but_not_quite
#> 1   1     2021-09-16 02:00:00                 2021-09-11 02:00:00
#> 2   2     2021-09-15 02:00:00                 2021-09-10 02:00:00
#> 3   3     2021-09-14 02:00:00                 2021-09-09 02:00:00
#> 4   4     2021-09-13 02:00:00                 2021-09-08 02:00:00
#> 5   5     2021-09-12 02:00:00                 2021-09-07 02:00:00
#> 6   6     2021-09-11 02:00:00                 2021-09-06 02:00:00
#> 7   7     2021-09-10 02:00:00                 2021-09-05 02:00:00
#> 8   8     2021-09-09 02:00:00                 2021-09-04 02:00:00
#> 9   9     2021-09-08 02:00:00                 2021-09-03 02:00:00
#> 10 10     2021-09-07 02:00:00                 2021-09-02 02:00:00

relevant_column_name <- names(
  which(
    # iterate over all columns
    sapply(dataframeB, function(x) {
      # unique is more efficient for large vectors
      x <- unique(x)
      # are all values of the target vector in the column
      all(dataframeA$some_info %in% x)
    })))

relevant_column_name
#> [1] "column_to_be_identified"

Thank you for the suggestion! When I run the code I only get 'character(0)' out and no column name. Does it mean that there is an error, or is it because the information from the column is not to be found in the dataframe? The dataformat for some_info and the column_to_be_identified are both POSIXct, does that has an impact? I have asked the exact same question to the answer below - just to let you know so I am not keeping two people busy at the same time :) — Mette, Sep 17 '21 at 08:38
I updated the answer but the code still works. it probably means that there is either no exact match or an time-related issue like different time zones, fractional seconds etc. try to generate a [reproducible minimal example](https://stackoverflow.com/q/5963269/8107362). Especially, provide some sample data, e.g. with `dput()` and use the [reprex-package](https://reprex.tidyverse.org/). — mnist, Sep 17 '21 at 10:14
see also this https://stackoverflow.com/questions/41582367/compare-date-time-logical-should-the-date-time-must-be-formatted-by-as-posixct — mnist, Sep 17 '21 at 10:21

score 1 · Answer 2 · answered Sep 16 '21 at 17:55

1

With select from dplyr we can do this

library(dplyr)
dataframeB %>% 
   select(where(~ is.character(.) && 
           all(dataframeA$some_info %in% .))) %>%
   names
[1] "column_to_be_identified"

answered Sep 16 '21 at 17:55

akrun

874,273
37
540
662

Thank you for the suggestion! When I run the code I only get 'character(0)' out and no column name. Does it mean that there is an error, or is it because the information from the column is not to be found in the dataframe? The dataformat for some_info and the column_to_be_identified are both POSIXct, does that has an impact? I have asked the exact same question to the answer above- just to let you know so I am not keeping two people busy at the same time :) – Mette Sep 17 '21 at 08:39
1

@MetteFoss Probably because your POSIXct have some milliseconds difference in each of them and thus not matching entirely – akrun Sep 17 '21 at 16:10

R: How to locate a column in a large dataframe by using information from another dataframe with less rows

2 Answers2