0

I'm looking into finding an algorithm where files should be matched to specific "buckets" based on some identifier properties on those buckets.

Example:

const buckets = [
  { id: 1, identifiers: ['jhdi753hhdy', 'foo', 'u-123'] },
  { id: 2, identifiers: ['hasd834kasd', 'bar', 'u-112'] },
  { id: 3, identifiers: ['adf8wersbay', 'buz', 'u-234', 'u-112'] },
]

bestMatch('file-jhdi7-53hhdy', buckets) // => [{ id: 1, ... }]
bestMatch('file-u112', buckets) // => [{ id: 2, ... }, { id: 3, ... }]
bestMatch('isBUZZfile', buckets) // => [{ id: 3, ... }]

The algorithm tries to find one or more matches (sorted), based on the given filename, and the best matches of the identifiers of the buckets.

I've already considered a naive implementation based on string matches, but that will only get me so far.

I'm fairly open to any type of solution. Either based purely on the client side (in memory) or server side using the capabilities of specific search engines.

Some considerations for the algorithm:

  • Buckets may have identifiers duplicated
  • Tokens may not be an exact match, but exact matches should be considered "better"
  • If the algorithm can say witch identifiers it used to match a result, that would be very nice plus
  • Having a "confidence" or "weight" for each match would be a very nice plus
Bert Goethals
  • 7,867
  • 2
  • 20
  • 32
  • 1
    What criteria, other than string matching, do you want to use for the mapping? From the examples you've shown it looks like you're using partial string matching, and I don't see any other data you could leverage. You'll probably want to delve more deeply into the world of partial string matching to find a solution. – Jim Mischel Jan 04 '23 at 23:53
  • 1
    Are you looking for [Damerau–Levenshtein distance](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance)? (Used in spell checkers.) – Neil Jan 05 '23 at 02:59
  • @Neil Thanks for that! Damerau–Levenshtein distance looks interesting, but not quite what I need. The identfiers and file names will be generated by other applications. So, "typos" is not a concern. Notations is eg: "foo-bar" should match "foobar" or "foo_bar" but not "fobar". The token may also be embedded in bigger strings like "isitfoobaryescool". I've updated the example to reflect that better. – Bert Goethals Jan 05 '23 at 09:38
  • 1
    @JimMischel Yeah, lacking a "proper" algorithm I would build an algorithm that standardises the tokens (removal of dashes etc), and then tries to match the original and standardised versions to the target string (and cleaned up versions of the target string). This question really is about existing algorithms that may have already solved this. – Bert Goethals Jan 05 '23 at 09:42
  • 1
    Sounds like regular expressions, DFA, Boyer–Moore string-search. Perhaps this [list of parser-generators](https://en.wikipedia.org/wiki/Comparison_of_parser_generators) and [What grammar based parser-generator tools exist for ruby?](https://stackoverflow.com/q/2897149/2472827) would be of interest? – Neil Jan 05 '23 at 18:53

0 Answers0