2

How can I do fuzzy string matching within PowerShell scripts?

I have different sets of names of people scraped from different sources and have them stored in an array. When I add a new name, I like to compare the name with existing name and if they fuzzily matches, I like to consider them to be the same. For example, with data set of:

@("George Herbert Walker Bush",
  "Barbara Pierce Bush",
  "George Walker Bush",
  "John Ellis (Jeb) Bush"  )

I like to see following outputs from the given input:

"Barbara Bush" -> @("Barbara Pierce Bush")
"George Takei" -> @("")
"George Bush"  -> @("George Herbert Walker Bush","George Walker Bush")

At minimum, I like to see matching to be case insensitive, and also flexible enough to handle some level of misspelling if possible.

As far as I can tell, standard libraries does not provide such functionalities. Is there an easy-to-install module which can accomplish this?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
hshib
  • 1,691
  • 1
  • 17
  • 22
  • If it was just matching strings inside other strings, [the -Match operator would do](https://stackoverflow.com/questions/18877580/powershell-and-the-contains-operator/18877724#18877724). – Peter Mortensen Jun 19 '18 at 20:27
  • Or the [`-like` operator](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_comparison_operators?view=powershell-6). – Peter Mortensen Jun 19 '18 at 20:41

1 Answers1

13

Searching at PowerShell Gallery with term "fuzzy", I found this package: Communary.PASM.

It can be simply installed with:

PS> Install-Package Communary.PASM                                                                                                     

The project is found here in GitHub. I simply looked at this examples file for reference.

Here is my examples:

$colors = @("Red", "Orange", "Yellow", "Green", "Blue", "Violet", "Sky Blue" )

PS> $colors | Select-FuzzyString Red

Score Result
----- ------   
  300 Red

This is a perfect match, with 100 max score for each characters.

PS> $colors | Select-FuzzyString gren

Score Result
----- ------
  295 Green 

It tolerate a little missing characters.

PS> $colors | Select-FuzzyString blue

Score Result  
----- ------     
  400 Blue       
  376 Sky Blue

Multiple values can be returned with different scores.

PS> $colors | Select-FuzzyString vioret

# No output

But it does not tolerate a little bit of misspell. Then I also tried Select-ApproximateString:

PS> $colors | Select-ApproximateString vioret
Violet

This has different API that it only returns a single match or nothing. Also it may not return anything when Select-FuzzyString does.

This was tested with PowerShell Core v6.0.0-beta.9 on MacOS and Communary.PASM 1.0.43.

abatishchev
  • 98,240
  • 88
  • 296
  • 433
hshib
  • 1,691
  • 1
  • 17
  • 22