4

I would like to ask a question regarding fuzzyjoin package. I am very new to R, and I promise I have read through the readme file and followed through examples on https://cran.r-project.org/web/packages/fuzzyjoin/index.html before I asked this question.

I have a list of vernacular names which I wanted to match with plant species names. A simple version of my list will look like below. Data 1 has a LocalName column with many typos of vernacular name. Data 2 is the table with correct local name and species where the matching should be based on.

data1 <- data.frame(Item=1:5, LocalName=c("BACTERIA F", "BAHIA", "BAIKEA", "BAIKIA", "BAIKIAEA SP")) 
data 1
  Item   LocalName
1    1  BACTERIA F
2    2       BAHIA
3    3      BAIKEA
4    4      BAIKIA
5    5 BAIKIAEA SP
data2 <- data.frame(LocalName=c("ENGOKOM","BAHIA","BAIKIA","BANANIER","BALANITES"), Species=c("Barteria fistulosa","Mitragyna spp","Baikiaea spp", "Musa spp", "Balanites wilsoniana"))
data2
      LocalName              Species
1   ENGOKOM   Barteria fistulosa
2     BAHIA        Mitragyna spp
3    BAIKIA         Baikiaea spp
4  BANANIER             Musa spp
5 BALANITES Balanites wilsoniana

I tried using the stringdist_left_join function, and it managed to match many species correctly. I am being conservative by setting max_dist=1 because in my list, many vernacular names are very similar.

library(fuzzyjoin)
table <- data1%>%
stringdist_left_join(data2, by=c(LocalName="LocalName"), max_dist=1)
table

  Item LocalName.x LocalName.y       Species
1    1  BACTERIA F        <NA>          <NA>
2    2       BAHIA       BAHIA Mitragyna spp
3    3      BAIKEA      BAIKIA  Baikiaea spp
4    4      BAIKIA      BAIKIA  Baikiaea spp
5    5 BAIKIAEA SP        <NA>          <NA>

However, I have one question. As you can see from data1, the Item 5 BAIKIAEA SP actually matches with the Species column of data2 instead of LocalName. I have many entries like this where the LocalName in data 1 were either typos of vernacular names or species name, however, I am not sure how to make stringdist_left_join matches two columns of data 2 with one column of data 1. I tried modifying the codes into something like this:

table <- data1%>%
stringdist_left_join(data2, by=c(LocalName="LocalName"|"Species"), max_dist=1)    

but it did not work, citing "Error in "LocalName" | "Species" : operations are possible only for numeric, logical or complex types". Does anyone know whether such matching is possible? Thanks in advance!

James Parsons
  • 6,097
  • 12
  • 68
  • 108
ywjong
  • 41
  • 1
  • I am hoping that the entries in LocalName.x could be identified to either LocalName.y or Species. For example, my ideal outcome will be that the Item 5 has the correct species name in the Species column of table (Baikiaea sp) instead of being . The Item 5 (BAIKIAEA SP) has no match on LocalName.y column in data2, but it matches the Species column in data 2 (Baikiaea sp). Will it be possible to write a code such that matching can be based on two columns from data 2 instead of just one? – ywjong May 08 '18 at 19:13
  • I don't know a simpler way to do this, but you can do a second strindist_left_join between LocalName and Species, then substitute all the NAs. – Luis May 08 '18 at 19:15
  • 1
    Based on the messages I see and the use of the plural "columns" in the docs I think he *should* be able to do this, but nothing I'm trying is working. Should ping the package author. David Robinson – Hack-R May 08 '18 at 19:22

0 Answers0