0

I have 500 TIFFs that I am using pytesseract to extract all text from. I'm then searching for a 'list' (search_list) of word in the returned string (df['String'] ) from pytesseract (in a pandas dataframe)

This works great using the formula below,

df['Found'] = df3['String'].str.findall('(' + '|'.join(search_list) + ')')

I want to incorporate fuzzy searching (regex?) so it will also look for substitutions ie. 'g' instead of 'c' etc where the OCR was not great. I found the singlwe line of code below, but cannot seem to integrate this into the above successfully. How would I go about this?

regex.findall("(ATAGGAGAAGATGATGTATA){s<=2}", "ATAGAGCAAGATGATGTATA", overlapped=True)

Edit: Note 'String' is over 500 characters, whereas the items in 'search_list' are only 10-15 characters long. This works fine with my originla code, it just cannot cope with any substitutions.

Edit2 : Example:

String = 'eh house tree unicorn jantern s g w there was 123 treadmill fountain 1 5 funny grash cymbal shampoo'

search =['crash cymbal','unicorn lantern']

I would like both 'crash cymbal' and 'unicorn lantern' to be found using fuzzy logic due to 1 substitution.

  • See [this](https://stackoverflow.com/a/56315491/9081267) answer. – Erfan Sep 23 '21 at 10:20
  • Do you just want to use the current `regex.findall("(ATAGGAGAAGATGATGTATA){s<=2}", "ATAGAGCAAGATGATGTATA", overlapped=True)` in Pandas? Or do you ask if you can allow only specific substitutions (only `g` or `c` can be substituted?) Can you please provide a test string and expected output? – Wiktor Stribiżew Sep 23 '21 at 10:23
  • I want to use the fuzzy functionality, but using 'search_list' and 'string' as per my original code. I should point out, 'string' is upto 500 characters long, and each item in 'search_list' is only 10-15 characters. therefore I do not beleive it can just be a join/merge, it has to be a findall (where it's looking within a longer string) – Mark Heptonstall Sep 23 '21 at 10:26
  • Ok, is `df["Found"] = df3["String"].apply(lambda r: regex.findall("(ATAGGAGAAGATGATGTATA){s<=2}", r, overlapped=True))` all you want? I don't get why you have `df` and `df3`, but let's assume they have the same amount of rows. – Wiktor Stribiżew Sep 23 '21 at 10:32
  • Mark, please provide feedback on what your intention is. – Wiktor Stribiżew Sep 23 '21 at 11:04
  • Apologies, it seems I'm not very clear. the ATAGGAGAAA.... is just an example snippet of code I found using fuzzy search logic. I need to search df3['String'] for strings contained in a list called 'search_list'. However df3['string'] is upto 500 characters, and any members of 'search_list' (only 10-15 characters long) may be contained within it with soem possible substitutions i.e. i to j c to g etc. I have successfully managed the search, returning all exact matches using my originla code. I'm struggling on how to incorporate looking for substitutions at character level in search_list – Mark Heptonstall Sep 23 '21 at 11:12
  • 1
    So, `df["Found"] = df3["String"].apply(lambda r: regex.findall('(?:' + '|'.join(search_list) + '){s<=2:[A-Z]}', r, overlapped=True))`? Extract those strings with up to 2 substitutions among uppercase letters? – Wiktor Stribiżew Sep 23 '21 at 11:17
  • Can I please ask what the ', r,' is doing before overlapped=True, and also the additional characters in '(:?' after findall - I only have ( in my original code ? – Mark Heptonstall Sep 23 '21 at 12:26
  • Not `:?` but `?:`, the order is very important, but in this case, if used with `re.findall`, it is not relevant, actually. Shall I post as an answer? Does it work for you? – Wiktor Stribiżew Sep 23 '21 at 12:27
  • Did that work as expected? – Wiktor Stribiżew Sep 23 '21 at 12:51
  • No, it does not quite yet, as I am getting hundreds of '','','','' etc matches mixed in with the substituted finds. I thought I may have a blank item (or single/double character length item) in my search list, but I do not. – Mark Heptonstall Oct 01 '21 at 12:36

1 Answers1

0

It seems to me you can use

import regex
#...
rx = fr'(?:{"|".join(search_list)}){{s<=2:[A-Z]}}'
df["Found"] = df3["String"].apply(lambda r: regex.findall(rx, r, overlapped=True))

The pattern will look like (?:a|b|c|etc){s<=2:[A-Z]} and will match either a or b or c even if up to two uppercase letters are substituted for some other letters.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563