-1

I am comparing two data frames using the function which and ==. The main goal is to collect the indices of those rows that have the same numeric values between the data frames. Most of the matches have been identified, but two values in particular, that are identical, have not been recognised. Any ideas about why this might be happening?

The outputs below are the two data frames I am comparing. For example, one of the issues is for the value 20005612212.

vulnIDs <- c(20005611101, 20005611102, 20005611103, 20005611104, 20005611105, 
             20005611106, 20005611107, 20005611108, 20005611109, 20005611110, 
             20005611111, 20005611112, 20005611113, 20005611114, 20005611115, 
             20005611116, 20005611117, 20005611118, 20005611119, 20005611120, 
             20005611121, 20005611122, 20005611123, 20005611124, 20005611125, 
             20005611126, 20005611127, 20005611128, 20005611129, 20005611130, 
             20005611131, 20005611132, 20005611133, 20005611134, 20005611135, 
             20005612201, 20005612202, 20005612203, 20005612204, 20005612205, 
             20005612206, 20005612207, 20005612208, 20005612209, 20005612210, 
             20005612211, 20005612212, 20005612213, 20005612214, 20005612215, 
             20005612216, 20005612217, 20005612218, 20005612219, 20005612220, 
             20005612221, 20005612222, 20005612223, 20005612224, 20005612225, 
             20005612226, 20005612227, 20005612228, 20005612229, 20005612230, 
             20005613301, 20005613302, 20005613303, 20005613304, 20005613305)

vulns_nmp <- c(20005612206, 20005612212, 20005612218, 20005612224, 20005612230, 
               20005613301, 20005613302, 20005613303, 20005613304, 2000561330)

If I run the following line,

test <- which(vulnIDs == vulns_nmp)

the output is,

> test
[1] 41 53 65 66 67 68 69

which does not include, for example, 47 as answer (for the value 20005612212)

  • 2
    Possible floating point mismatch? https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal – Jon Spring Sep 16 '22 at 16:20
  • 1
    Unless you can share an example, we'll just be speculating. If you have data in data frame `df1` and `df2` and want to share, say, row 5 of one and row 10 of another, which you expect should match, you could run `dput(df1[5,])` and `dput(df2[10,])` and copy the output code into your question. Then we could create a copy of your data with the same data formats and all, and help diagnose the problem. – Jon Spring Sep 16 '22 at 16:25
  • @JonSpring Thanks. I have updated my question. I hope it is clearer now – Omar Velazquez Sep 16 '22 at 16:59
  • 1
    (1) Don't use `dput` inside your `which(..)`: while dput does return its data (invisibly), it also unnecessarily dumps data contents to the console. (2) This is ultimately a duplicate of https://stackoverflow.com/q/15358006/3358272, you should only need `which(vulnIDs$VulnID %in% vulns_nmp$VulnID)`. – r2evans Sep 30 '22 at 13:28

1 Answers1

3

In this situation you should use %in% instead of ==. == comapares the two vectors element by element recylcing the shorter vector to fit the larger vector. Whereas %in% matches the LHS to the RHS.

vid <- c(20005611101, 20005611102, 20005611103, 20005611104, 20005611105, 
         20005611106, 20005611107, 20005611108, 20005611109, 20005611110, 
         20005611111, 20005611112, 20005611113, 20005611114, 20005611115, 
         20005611116, 20005611117, 20005611118, 20005611119, 20005611120, 
         20005611121, 20005611122, 20005611123, 20005611124, 20005611125, 
         20005611126, 20005611127, 20005611128, 20005611129, 20005611130, 
         20005611131, 20005611132, 20005611133, 20005611134, 20005611135, 
         20005612201, 20005612202, 20005612203, 20005612204, 20005612205, 
         20005612206, 20005612207, 20005612208, 20005612209, 20005612210, 
         20005612211, 20005612212, 20005612213, 20005612214, 20005612215, 
         20005612216, 20005612217, 20005612218, 20005612219, 20005612220, 
         20005612221, 20005612222, 20005612223, 20005612224, 20005612225, 
         20005612226, 20005612227, 20005612228, 20005612229, 20005612230, 
         20005613301, 20005613302, 20005613303, 20005613304, 20005613305
)

nmp <- c(20005612206, 20005612212, 20005612218, 20005612224, 20005612230, 
         20005613301, 20005613302, 20005613303, 20005613304, 20005613305
)

which(vid==nmp)
#[1] 41 53 65 66 67 68 69 70
which(vid %in% nmp)
#[1] 41 47 53 59 65 66 67 68 69 70
emilliman5
  • 5,816
  • 3
  • 27
  • 37