Efficiently check if a string is an approximate substring of (approximately contrained in) another string, up to a given error threshold?

Question

Take two character strings in C or C++, s1 and s2. It is rather trivial to check if one contains the other exactly.

The following will return true in case s2 is a substring of s1.

In C:

strstr(s1, s2)

In C++:

#include <string>
str.find(str2) != string::npos

With boost:

#include <boost/algorithm/string.hpp>
boost::algorithm::contains(s1, s2)

What I am looking for is an analogous way of efficiently (in terms of speed, not memory) finding whether a string is approximately contained in / is approximately a substring of another up to a given difference threshold. Pretty much like the agrep bash command does in Unix systems for finding files.

For example, suppose the function was called approx_contains. Then approx_contains(s1, s2, 4) could return true/false in case s2 is contained within s1 up to a Levenshtein distance of 4.

Searching online, I only found numerous references and questions on how to calculate Levenshtein distance between two strings, or theoretical algorithms for approximate string pattern matching that go beyond simply checking whether or not one string contains another - which here becomes wasteful.

Trying very hard to avoid reinventing the wheel, how could someone do such a checking in C or C++?

@S.M. That is **not** what I want and I even mentioned both the Levenshtein distance and that I easily found it online in my question. Calculating the aka edit distance is one small part of the task I describe. Please slow down on the rush to diminish questions and assume question authors are all dumb. Is it difficult? — 太晚了, Feb 17 '21 at 06:11
@S.M. it would mean running edit distance algorithm at *each position* and *each substring of certain length* so it would not be really that efficient... — Antti Haapala -- Слава Україні, Feb 17 '21 at 06:28
@太晚了 https://en.wikipedia.org/wiki/Approximate_string_matching#Problem_formulation_and_algorithms — Antti Haapala -- Слава Україні, Feb 17 '21 at 06:29
@JoëlHecht I could guess that `aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbaaaaaaaaaaaaaaaaaaaaaaaaaaa` and `bbbb` would need to be a match with approx contains 0, but levenshtein distance is >> 0. — Antti Haapala -- Слава Україні, Feb 17 '21 at 06:32
check this link: https://www-igm.univ-mlv.fr/~lecroq/string/node6.html — Kermit the Frog, Feb 17 '21 at 09:33

maxbachmann · Answer 1 · 2021-02-24T09:54:20.343

2

Another option is the library https://github.com/maxbachmann/rapidfuzz-cpp (I am the author) which includes a similarity algorithm called partial_ratio. It performs the following steps:

searches for the longest common subsequences of the two strings
Iterates over those subsequences and calculates a normalized Levenshtein distance between the shorter string and a substring of the longer string, with len(shorter string), that starts at the start of the subsequence
returns score of the optimal alignment

As an example:

#include "rapidfuzz/fuzz.hpp"
#include <string>
std::string a = "txst";
std::string b = "this is a test";
double score = rapidfuzz::fuzz::partial_ratio(a, b);
// score is 75.0

In this example the similarity of the optimal alignment "txst" <-> "test" is calculated. Since only one substitution is required the similarity is 75%.

Getting edit distance instead of relative similarity

As you noted in the comments you are interested in the edit distance instead of relative similarity. Here is a reimplementation of partial_ratio, that does this:

template <typename Sentence1, typename Sentence2>
std::size_t partial_ratio(const Sentence1& s1, const Sentence2& s2, std::size_t max=-1)
{
  auto s1_view = rapidfuzz::common::to_string_view(s1);
  auto s2_view = rapidfuzz::common::to_string_view(s2);

  if (s1_view.empty() && s2_view.empty()) {
    return 0;
  }

  if (s1_view.empty() || s2_view.empty()) {
    return -1;
  }

  // this swap is performed in the original FuzzyWuzzy implementation, so
  // I use it in my implementation as well. Depending on your goals you
  // might want to remove this swap
  if (s1_view.length() > s2_view.length()) {
    return partial_ratio(s2_view, s1_view, max);
  }

  // Note this is a internal API and so it could change at any time
  // currently this is the slowest part, since my bitparallel
  // implementation of the LCS has a bug and so it is disabled for now
  auto blocks = rapidfuzz::detail::get_matching_blocks(s1_view, s2_view);

  // when there is a full match exit early
  for (const auto& block : blocks) {
    if (block.length == s1_view.length()) {
      return 0;
    }
  }

  // find the edit distance at the places with the longest common subsequence
  std::size_t min_dist = (std::size_t)-1;
  for (const auto& block : blocks) {
    std::size_t long_start = (block.dpos > block.spos) ? block.dpos - block.spos : 0;
    auto long_substr = s2_view.substr(long_start, s1_view.length());

    // Note you could use a different weight here like
     // e.g. {1, 1, 1} for the normal Levenshtein distance
    // only {1, 1, 1} and {1, 1, 2} have very fast implementations
    std::size_t dist = rapidfuzz::string_metric::levenshtein(
        s1_view, long_substr, {1, 1, 2}, max);

    if (dist < min_dist) {
      max = min_dist = dist;
    }
  }

  return min_dist;
}

s1and s2can be any type that can be converted to a rapidfuzz::basic_string_view. Some examples are:

std::basic_string (std::string, ...)
std::basic_string_view (std::string_string, ...) since C++17
c strings like char* (this has to find the length once when creating the string_view)
many other type, that have data() and .size() and use continous memory. Examples are boost::string_view or std::vector

For results with a edit distance above max (size_t)-1) i is returned instead.

edited Feb 24 '21 at 09:54

answered Feb 19 '21 at 05:31

maxbachmann

2,862
1
11
35

upon exploring a bit with your library (which btw is great, very easy to install and use), one problem for my use case is that it does not seem to handle wide chars. And with that I don't think I can find a way to make `partial_ratio("a wider tast", "a wider test")` and `partial_ratio("a wider tést", "a wider test")` return the same. They return, respectively, returns `83.3333` and `91.6667`. – 太晚了 Feb 21 '21 at 08:32
It works on a character level, so it does not understand things like surrogates. However you can e.g. pass in strings with 8/16/32 bit and you can mix them as well (compare e.g. 8 bit string to 32 bit string). `partial_ratio(L"a wider tést", "a wider test")` should return `91.6667` – maxbachmann Feb 21 '21 at 08:55
one thing that still confuses me is how exactly can I use the partial ratio to identify whether `a` is or not a substring of `b` up to a given pre-specified number of edits (ins/del/subs) or something like that. Because `partial_ratio("test two", "this is a new test, this time with the number two")` also returns score 75, but clearly this example requires way more changes to `a` than your original example. – 太晚了 Feb 21 '21 at 09:49
It uses the optimal alignment: "test two" <-> "test, th" and then calculates a weighted Levenshtein distance (Insertion/Deletion=1, Substitution=2) and normalizes the result using `100 - 100 * dist / (len1 + len2)` In your case the aligned strings require `1 Insertion, 1 Deletion and 1 Substitution` -> `Edit Distance 4` -> `100 - 100 * 4 / 16` -> `100-25 = 75` – maxbachmann Feb 21 '21 at 11:47
yeah, which means that for you library to solve my task I would need to know also `len1` and `len2`, with which then I could recover the number of edits and use that as threshold (I can't have sentences like in my last example yield the same result than the ones in your answer). It's a pity, because your solution seems to be faster than my current attempts. Oh well, I will see if I can manage to alter parts of your code to return the number of edits directly, or at least `len1` and `len2` – 太晚了 Feb 22 '21 at 05:56
I updated the answer to contain a version that returns the edit distance instead – maxbachmann Feb 22 '21 at 16:40
Many thanks, I really appreciate all the effort and will test it right away. Just so you know, the code you added needs some corrections: `score_cutoff`, `dict` and `max_ratio` are not defined anymore. I think we can solve that by dropping `(score_cutoff > 100)` (should one use a different condition with `max` to return zero here?), by changing `dict` to `dist` and `max_ratio` to `max`. Also, with the most current code in your repository, note that there is a bug: in line 263 of file `string_metric.hpp` you should change `detail::normalized_generic_levenshtein` by `normalized_levenshtein`. – 太晚了 Feb 24 '21 at 04:48
ok, tested it. It's almost there, something is a bit off in the calculation. To make things easier to talk about, I set edit costs to `{1,1,1,}` throughout `fuzz_impl.hpp`. So here are some string pairs and their results:`("txst", "this is a test")` returns 2 instead of one despite of setting sub cost to 1. Weirdly, `("txst", "test")` returns 0. In fact `("zzzz", "test")` also returns 0. `("testa", "test this")` returns 4. – 太晚了 Feb 24 '21 at 05:42
Ah I forgot to add the normalized generic levenshtein distance from the development branch. Also updated the code in my answer to fix the variable namings and gave it a try locally. It returns the correct results for me now. – maxbachmann Feb 24 '21 at 06:34
excellent, many thanks! I will do some more testing tomorrow. But in the meanwhile, looking at your updated code snippet posted here, I think the early return at `(s1_view.length() > s2_view.length())` should be also dropped for proper results - otherwise, the pair ("testa", "test") would return 0 instead of 1. I am a bit unsure about what to do with `(s1_view.empty() || s2_view.empty())` in this particular scenario – 太晚了 Feb 24 '21 at 06:58
I generated thousands of random strings of different len. In the vast majority, your modified function works perfectly (ins/del/sub costs=1). However, some weird cases suggest a small bug in internal Levenshtein distance calculation. For ex. `("hi", "ice")` returns 2 instead of 1 (deleting "h" suffices); `("lima", "lmao")` returns 2 instead of 1 (deleting "i" suffices) but `("limao", 'lmao")` correctly returns 1. Yet, `("limao", "lmaoo")` returns 2 instead of 1. Also, `("hotus", "housing") returns 2 instead of 1 (del "t" suffices) and interestingly `("hottus", "housing") returns 4 instead of 2 – 太晚了 Feb 24 '21 at 08:13
Could you open a issue on the GitHub repository? I think it would be easier to discuss this over there. Btw e.g. "lima" <-> "lmao" requires a deletion of i and the insertion of o -> edit distance 2 – maxbachmann Feb 24 '21 at 09:46
So far my way of testing the implementation is a fuzz test against a simpler Levenshtein implementation: https://github.com/maxbachmann/RapidFuzz/blob/e8102a4e87aabcb1bcebe360c3827703e11f1ba7/tests/test_hypothesis.py#L13, which did not pick up any errors in the current Levenshtein implementation. – maxbachmann Feb 24 '21 at 09:53
Just to clarify, my examples in last comment where using your modified `partial_ratio` function. That is, they are calculations considering LCS. Hence, `"hi"` is 1 del away from being a substring of `"ice"`, same for `"hotus"` to be a substring of `"house"`, but modified `partial_ratio` returns 2, so on and so forth with other examples. But sure thing, later today I will put up an issue in your repository so we can further discuss! – 太晚了 Feb 24 '21 at 12:29
It does search for the optimal alignment and then uses a substring with the length of the shorter string: `s2_view.substr(long_start, s1_view.length());`. So it calculates the distance of `hi` <-> `ic`. – maxbachmann Feb 24 '21 at 13:54

Kermit the Frog · Answer 2 · 2021-02-24T22:42:17.753

I had a test program I made several years ago, but it's still working. The length of the pattern is limited to the number of bits in the CPU's registers. Described here: https://en.wikipedia.org/wiki/Bitap_algorithm. In the "See also" section on this site is a link to an open source lib, https://en.wikipedia.org/wiki/TRE_(computing)

#include <iostream>
#include <string>
#include <vector>
#include <ctype.h>

using namespace std;

typedef unsigned int     UINT;
typedef unsigned __int64 UINT64;

// #define WIDECHAR

#if defined(WIDECHAR)
typedef wstring       String;
typedef wchar_t       Char;
#define CIN           wcin
#define COUT          wcout
#else
typedef string        String;
typedef unsigned char Char;
#define CIN           cin
#define COUT          cout
#endif

template<typename T> T inputValue(const char *prompt) {
  for(;;) {
    COUT << prompt;
    T value;
    CIN >> value;
    if(CIN) return value;
    CIN.clear();
  }
}

// https://en.wikipedia.org/wiki/Bitap_algorithm

class ShiftOr {
private:
#if defined(WIDECHAR)
  static constexpr size_t s_masklength = (0xffff + 1);
#else
  static constexpr size_t s_masklength = (0xff + 1);
#endif // WIDECHAR
  UINT64   m_s;
  UINT     m_patternLen;
  UINT64  *m_mask;

  static constexpr size_t s_maskSizeInBytes = s_masklength * sizeof(m_mask[0]);

  void initMask(const UINT64 *src = nullptr) {
    if(m_mask == nullptr) {
      m_mask = new UINT64[s_masklength];
    }
    if(src) {
      memcpy(m_mask, src, s_maskSizeInBytes); // copy all value from src into m_mask
    } else {
      memset(m_mask, -1, s_maskSizeInBytes); // set all bits in m_mask-array to 1
    }
  }
  void deallocMask() {
    delete[] m_mask;
  }
public:
  ShiftOr()
    : m_s(         0      )
    , m_patternLen(0      )
    , m_mask(      nullptr)
  {
  }
  ShiftOr(const ShiftOr &src)
    : m_s(         src.m_s         )
    , m_patternLen(src.m_patternLen)
    , m_mask(      nullptr         )
  {
    if(src.m_mask) {
      initMask(src.m_mask);
    }
  }
  ShiftOr(const String &pattern, bool ignoreCase=false)
    : m_s(         0      )
    , m_patternLen(0      )
    , m_mask(      nullptr)
  {
    compilePattern(pattern, ignoreCase);
  }
  ShiftOr &operator=(const ShiftOr &src) {
    m_s          = src.m_s;
    m_patternLen = src.m_patternLen;
    if(src.m_mask) {
      initMask(src.m_mask);
    } else {
      deallocMask();
    }
    return *this;
  }
  virtual ~ShiftOr() {
    deallocMask();
  }
  void     compilePattern(const String &pattern, bool ignoreCase=false);
  intptr_t search(        const String &str                           ) const;
  intptr_t searchApprox(  const String &str, UINT maxErrors           ) const;
};

void ShiftOr::compilePattern(const String &pattern, bool ignoreCase) {
  m_patternLen = (UINT)pattern.length();
  if(m_patternLen >= 64) {
    throw string("pattern too long for shiftor-search. max length is 63");
  }
  initMask();
  for(UINT i = 0; i < m_patternLen; i++) {
    const Char ch = pattern[i];
    m_mask[ch] &= ~((UINT64)1 << i);
    if(ignoreCase) {
      if(iswlower(ch)) {
        m_mask[_toupper(ch)] &= ~((UINT64)1 << i);
      } else if(isupper(ch)) {
        m_mask[_tolower(ch)] &= ~((UINT64)1 << i);
      }
    }
  }
  m_s = (UINT64)1 << m_patternLen;
}

intptr_t ShiftOr::search(const String &str) const {
  const UINT64 maskEnd  = m_s;
  UINT64       s        = ~1;
  const Char *start = (Char*)str.c_str(), *end = start + str.length();
  for(const Char *cp = start; cp < end;) {
    s = (s | m_mask[*(cp++)]) << 1;
    if((s & maskEnd) == 0) {
      return cp - start - m_patternLen;
    }
  }
  return -1;
}

intptr_t ShiftOr::searchApprox(const String &str, UINT maxErrors) const {
  if(maxErrors == 0) {
    return search(str);
  }
  const UINT64     maskEnd  = m_s;
  vector<UINT64>   s;
  for(UINT i = 0; i < maxErrors + 1; i++) {
    s.push_back((UINT64)~1);
  }
  UINT64 *sfirst = s.data(), *slast = sfirst + maxErrors;
  const Char *firstChar = (Char*)str.c_str(), *lastChar = firstChar + str.length();
  for(const Char *cp = firstChar; cp < lastChar;) {
    const UINT64 mask = m_mask[*cp++];
    UINT64      *sp   = sfirst, olds = *sfirst;
    *sp = (olds | mask) << 1;

    while(sp++ < slast) {
      const UINT64 tmp = *sp;
      /* Substitution is all we care about */
      *sp = (olds & (tmp | mask)) << 1;
      olds = tmp;
    }
    if((*slast & maskEnd) == 0) {
      return cp - firstChar - m_patternLen;
    }
  }
  return -1;
}

int main(int argc, char **argv) {
  for(;;) {
    const String   pattern    = inputValue<String>("Enter pattern:");
    const bool     ignoreCase = inputValue<Char>("Ignore case[yn]:") == 'y';
    const ShiftOr  A(pattern, ignoreCase);
    const UINT     maxErrors  = inputValue<UINT>("Enter maxErrors:");
    for(;;) {
      const String text = inputValue<String>("Enter text:");
      if((text.length() > 0) && text[0] == '!') {
        break;
      }
      const intptr_t i = A.searchApprox(text,maxErrors);
      cout << "result:" << i << endl;
    }
  }
  return 0;
}

many thanks for that! I am very eager to try your code, but unfortunately I do not have access to a Windows machine and it seems that `stdafx.h` is a Visual Studio header file. That I think I am able to bypass. However, I cannot find any information on `InputValue.h`. What is that header file? — 太晚了, Feb 24 '21 at 06:40
I really appreciate that. Just worked a bit with your code. It is quite a good implementation, it runs flawlessly and *real* fast. The only downside, which is inherent to the original algorithm version, is that it does not handle deletions and insertions. I am now trying to find enough information to see if I can update it toward that — 太晚了, Feb 24 '21 at 23:58

Efficiently check if a string is an approximate substring of (approximately contrained in) another string, up to a given error threshold?

2 Answers2

Getting edit distance instead of relative similarity