0

I previously asked a question here about how to use R to automatically "spellcheck" a big list of department names before I export a file and send it off. (Same data can be used as reproducible example)

The solution of using Fuzzy Join worked perfectly and 99% of the time its exactly what I need. Here's an example of when it works great:

enter image description here

As you can see, it needs to look like Hematology/Oncology and it previously looked like Hematology Oncology. Fuzzy Join figured it out great.

The problem comes when one of the inputs is just too far off and fuzzy join can't figure it out (I apologize, this example wasn't in the reproducible data, its from my real data):

enter image description here

In this example, Fuzzy join just couldn't figure it out and suggested "Sleep lab" when someone wrote "IS".

Due to the nature of my real data, there's going to be a lot of this. So my question is:

Either WHEN Fuzzy join does the joining in this code:

final_df <- stringdist_join(df, df2,
  by = "ManagementGroup",
  mode = "left",
  ignore_case = FALSE,
  method = "jw",
  max_dist = 99,
  distance_col = "dist") %>%
  group_by(ManagementGroup.x) %>%
  slice_min(order_by = dist, n = 1) %>%
  distinct()

or afterwards, before I export it using:

write_csv(final_df, "finaldf.csv")

Can I have R warn me that there were matches over a certain threshold of "dist" and filter that row out of the results and put it into a separate data frame? At least with a 'warning' but ideally even an audible warning using "beepr" or something.

My end goal is that R will automatically handle 99% of the cases and I might have to manually input 1 or 2 that were just too misspelled etc.. but I'll be warned that I need to do that.

Joe Crozier
  • 944
  • 8
  • 20
  • 1
    You have a `distance_col` called `dist` so just look at values of that below a certain threshold. You will be in trouble though with words that are close in Levenshtein distance but not in meaning, e.g. brain/Brian. I don't know if this is a particular problem in your domain but it is worth keeping in mind. – SamR Apr 06 '22 at 13:52
  • Oh absolutely agreed. I know how to filter that column and/or sort it etc if I was already purposefully looking for it... I guess my question was about how do I get R to throw a warning when I wasn't paying attention. – Joe Crozier Apr 06 '22 at 13:59

0 Answers0