R: pmatch: 'TERESA DEL CA' does not match with 'TERESA DEL#CARMEN'

Question

I have two dataframes with almost identical data:

Test.Takers with 29260 observations and the following column names:

Paternal.Name, Maternal.Name, First.Name, Application.Number

and Every.Student.In.The.Country with 12000000 observations the following column names:

Paternal.Name, Maternal.Name, First.Name, Application.Number

Test.Takers$Application.Number is filled with NA values and I want to fill that field with Application.Numbers found in Every.Student.In.The.Country.

I tried to do this by subsetting both the Paternal.Names and the Maternal.Names from Every.Student.In.The.Country. I would then fill in the Test.Takers$Application.Number with the following code:

Test.Takers$Application.Number[i] <- subset$Application.Number[pmatch(as.character(Test.Taker$First.Name[i]), subset$First.Names)]

This was able to fill in about 2/3 of Test.Takers$Application.Number. After trying to figure out why so many Test.Takers$Application.Number remained NA I found that some names from the Every.Student.In.The.Country$First.Name contain a '#'. I think that the '#' throws off the pmatch function so that a name from Test.Takers$First.Name such as 'TERESA DEL CA' does not match with a name from Every.Student.In.The.Country$First.Name such as 'TERESA DEL#CARMEN'.

Any advice on how to resolve this problem would be awesome. I have a feeling that something along the lines of a regex function might help, but I'm not too sure.

EDIT: Here is some sample code to replicate the problem. Remember that the real data that I am dealing with is pretty big - about 30000 and 12000000 observations each. If you look through this code and find any inefficiencies, then please let me know.

Test.Takers <- data.frame(
    Paternal.Name = c('Last', 'Last', 'Paternal'),
    Maternal.Name = c('Maternal', 'Last', 'Last'),
    First.Name = c('First', 'Name', 'TERESA DEL CA'),
    Application.Number = NA)

Every.Student.In.The.Country <- data.frame(
    Paternal.Name = c('Last', 'Last', 'Last', 'Paternal', 'Paternal', 'Paternal'),
    Maternal.Name = c('Maternal', 'Last', 'Maternal', 'Last', 'Maternal', 'Last'),
    First.Name = c('First', 'Name', 'Whatever', 'TERESA DEL#CARMEN', 'Another', 'Something Else'),
    Application.Number = c(123, 456, 789, 234, 567, 890)
)

#a place holder that will hold a subset of all a selected paternal last names
indexp <- data.frame(Paternal.Name='name')

for(i in 1:nrow(Test.Takers)) {
    namep <- as.character(Test.Takers$Paternal.Name[i])

    #below if statement prevents us from having to subset the paternal lastname unnecessarily

    if(is.na(indexp$Paternal.Name[1]) == T | as.character(indexp$Paternal.Name[1]) != namep) { 
        indexp <- subset(Every.Student.In.The.Country, Paternal.Name == as.character(Test.Takers$Paternal.Name[i]))
    }

    #below if-statement prevents an error that arrises
    #when a paternal last name does not exist
    #in the Every.Student.In.The.Country file

    if(is.na(indexp$Paternal.Name[1]) == F) {


    #group paternal last names by maternal last names
    indexm <- subset(indexp, Maternal.Name == as.character(Test.Takers$Maternal.Name[i]))    

    #find a partial string match to find an exact or similiar first name within the selected
    #last name subset. Attaches a application.number if a match is found

    Test.Takers$Application.Number[i] <- indexm$Application.Number[pmatch(as.character(Test.Takers$First.Name[i]), indexm$First.Name)]     
}}

Welcome to StackOverflow. Please take a look at these tips on how to produce a [minimum, complete and verifyible example](http://stackoverflow.com/help/mcve), as well as this post on [creating a great example in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Perhaps the following tips on [asking a good question](http://stackoverflow.com/help/how-to-ask) may also be worth a read. — lmo, Jun 21 '16 at 17:47

score 1 · Accepted Answer · answered Jun 21 '16 at 18:58

1

If the # sign is the only issue you could add duplicates.ok=TRUE in you function Test.Takers$Application.Number[i] <- subset$Application.Number[pmatch(as.character(Test.Taker$First.Name[i]), subset$First.Names),dup=T]

or you could remove the # sign Test.Takers$Application.Number[i] <- subset$Application.Number[pmatch(as.character(Test.Taker$First.Name[i]),gsub("#"," ",subset$First.Names))]

answered Jun 21 '16 at 18:58

Jeonifer

74
2

ok, gsubing worked for the sample code. I'll check it out for the much bigger data.frame that I have. The code takes a long time to run, but I should be able to tell you the results by tomorrow. Thanks a lot! – beemyfriend Jun 21 '16 at 19:43
gsubing worked, but it only found 1/6 of the missing observations. I have to keep on figurring out how to solve this problem... – beemyfriend Jun 22 '16 at 14:31
You probably have other special cases. You could also try both the gsub and dup=T. – Jeonifer Jun 22 '16 at 14:59
The only case which was obvious to me was the names that had '#' in them. It turns out that those names are only 1/6 of the problem :'0(. I'll have to look at other problematic patterns. All the same, thank you for helping me out with this. gsubbing out the '#' helped me find application numbers for about 2000 people. – beemyfriend Jun 22 '16 at 15:09

R: pmatch: 'TERESA DEL CA' does not match with 'TERESA DEL#CARMEN'

1 Answers1