I have over 17,000 pages that have been scanned (for a local history archive) which I have OCRed using Tesseract to individual TXT files. I want to be able to search/locate every page containing a search word of more than 3, lower case letters. So for each TXT file I need to:
- Delete all rubbish from the OCR text i.e. non-alphanumeric characters - jrepl "[^a-zA-Z0-9\s]" "" /x /f %%G /O -
- Remove 1, 2 and 3 letter words - jrepl "\b\w{1,3}\b" "" /x /f %%G /O -
- Change all characters to lower case - jrepl "(\w)" "$1.toLowerCase()" /i /j /x /f %%G /O -
- To be able to sort the remaining words they need to be on separate new lines - jrepl "\s" "\n" /x /f %%G /O -
- Finally sort all unique words into alphabetic order and create the modified TXT file - sort /UNIQUE %%G /O %%G
I have a batch file that does the above using JREPL but it is very slow. It has been running for over 100 HOURS and I'm not even half way. Any suggestions so as to speed up the processing? I am running Windows 10. Thanks.