Function in R for getting a 0 in new column when value in another column equals any of the rows in another dataset

Question

I have a list of names in one dataset and a column for 'name' in another dataset. I was R to give me a new column where it says 1 if any of the names in my first dataset appear in the column 'name' in that row. In other words, I want it to go row by row, and for a value in a cell of that row, look in my first dataset. If the value appears in my first dataset, I want it to code it as a 1 in a new column. Can you help? I apologize for not providing the data structure - it's my first time posting. Here is what I am trying to do.

myDataSet1 <- as.data.frame( cbind( "firstname" = c("Jenny", "Jane", "Jessica", "Jamie", "Hannah"), "year" = c(2018, 2019, 2020, 2021, 2022)  ) )
    
myDataSet2 <- as.data.frame( cbind( "name" = c("Jenny", "John", "Andy", "Jamie", "Hannah", "Donny"), "dob" = c(1, 2, 3, 4, 5, 6) ) )

I want to know if each of the names listed in column myDataSet1$firstname's each row appear anywhere in mydataset2$name column. So, in this case, an ideal result would look like this.

myDataSet1

firstname  year  namematch
Jenny      2018  1
Jane       2019  0
Jessica    2020  0
Jamie      2021  1
Hannah     2022  0

Welcome. Please read [this post](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and provide a reproducible example of your data and expected output. Thank you. — user438383, Apr 26 '22 at 15:20
I am closing this question, please provide a reproducible example as suggested by @user438383. — DaveArmstrong, Apr 26 '22 at 17:03

score 0 · Accepted Answer · answered Apr 26 '22 at 15:42

0

Please supply some example of your data, i'm trying to guess with some random data:

    myDataSet1 <- as.data.frame( cbind( "PersonName" = c("Peter", "Jane", "John", "Louis", "Hannah"), 
                                        "NumberOfDogs" = c(9, 2, 5, 3, 5) ) )
    
    myDataSet2 <- as.data.frame( cbind( "Name" = c("Nora", "John", "Andy", "Louis", "Hannah", "Donny"), 
                                        "NumberOfCats" = c(1, 2, 3, 4, 5, 6) ) )
    myDataSet1
    myDataSet2
    
    # This applies anonymous function to each name of Mydataset1 -- PersonName, 
    # tests whether it is contained anywhere inside MyDataSet2 -- Name and return result of 0/1.
    myDataSet1$IsInDataSet2 <- sapply(myDataSet1$PersonName, 
                                      function(currentName) as.integer( currentName %in% myDataSet2$Name) )

Result

myDataSet1

PersonName NumberOfDogs IsInDataSet2
1      Peter            9            0
2       Jane            2            0
3       John            5            1  #contained in DataSet2
4      Louis            3            1  #contained in DataSet2
5     Hannah            5            1  #contained in DataSet2

answered Apr 26 '22 at 15:42

L D

593
1
3
16

1

@L D thank you. This is exactly what I am trying to do. Essentially, look at any of the values in "personame" and see if any of them match the row value in my data frame. if so, give me a 1. I'm trying to execute the command in r that you wrote up. it's taking quite awhile (one df has 500k rows, and the other has 6k). I'm wondering as I wait, i don't understand what the "currentName" is referring to in this command. could you explain? thanks! – oiuerl Apr 26 '22 at 16:58
Good! The `sapply` function applies a function with a single parameter (here named `currentName` ) to a given list (here myDataSet1$PersonName). ` A little bit more readable version could be without using the inline anonymous function, i.e., creating `testNamePresence <- function(testedName, where) { ... }` where `testedName` is the same as `currentName`. The choice for `current` was due to fact it is the current name tested by the apply. Also, this solution is not great for large data, but if you edit your question with the data (you can copy mine), others will write faster solution for sure! – L D Apr 26 '22 at 17:18
Thank you so much. I edited, but the question is closed and I messed up so bad that I can't ask again until tomorrow. :( Thanks so much for your help! – oiuerl Apr 26 '22 at 17:26
Good, you are welcomed. Yeah, i've noticed, nothing happens! You can ask in comments if needed, also do not forget to upvote the answer – L D Apr 26 '22 at 17:35
Oh, I tried! But I'm so new it won't let me upvote. R finally finished running the commands, but it didn't work! It returned a 0 for everything :( Thanks for trying :) – oiuerl Apr 26 '22 at 19:21
There aren't many places where this code could go wrong, it is however possible that the dataframe, where you search and also the source dataframe, may have _factors_ instead of _strings_ in the name columns -- you can check it out with function `str( dataset)` , e.g., `str(iris)` shows that the last column of this dataframe is factor rather than a string. R loads all strings as factors if not paramitrized to False... This could explain why it failed. Also you can get only a part of the dataset for tests rather than waiting too long, e.g. using `head(dataset, 100)` will take first 100 rows. – L D Apr 26 '22 at 19:29
Thank you! I tested str on my dataset and it looks like the variable of interest in both datasets is character – oiuerl Apr 26 '22 at 19:34
Good, then i would check the format of the texts (trailing spaces, whitespaces, encoding, lowercase/uppercase, etc.) and if this looks fine, then test direct equivalence of the names. For example take any name from Dataset 1 and test whether it can be found in Dataset2: `dataset1[1, "Name"] %in% dataset2[,"Name"]`, or better, to check directly the match: `dataset1[1, "Name"] == dataset2[12345, "Name"]`, if the answer is TRUE, something is wrong with the code. If FALSE, something is wrong with the texts – L D Apr 26 '22 at 19:45

Function in R for getting a 0 in new column when value in another column equals any of the rows in another dataset

1 Answers1