I have 500 TIFFs that I am using pytesseract to extract all text from. I'm then searching for a 'list' (search_list) of word in the returned string (df['String'] ) from pytesseract (in a pandas dataframe)
This works great using the formula below,
df['Found'] = df3['String'].str.findall('(' + '|'.join(search_list) + ')')
I want to incorporate fuzzy searching (regex?) so it will also look for substitutions ie. 'g' instead of 'c' etc where the OCR was not great. I found the singlwe line of code below, but cannot seem to integrate this into the above successfully. How would I go about this?
regex.findall("(ATAGGAGAAGATGATGTATA){s<=2}", "ATAGAGCAAGATGATGTATA", overlapped=True)
Edit: Note 'String' is over 500 characters, whereas the items in 'search_list' are only 10-15 characters long. This works fine with my originla code, it just cannot cope with any substitutions.
Edit2 : Example:
String = 'eh house tree unicorn jantern s g w there was 123 treadmill fountain 1 5 funny grash cymbal shampoo'
search =['crash cymbal','unicorn lantern']
I would like both 'crash cymbal' and 'unicorn lantern' to be found using fuzzy logic due to 1 substitution.