0

I have over 17,000 pages that have been scanned (for a local history archive) which I have OCRed using Tesseract to individual TXT files. I want to be able to search/locate every page containing a search word of more than 3, lower case letters. So for each TXT file I need to:

  1. Delete all rubbish from the OCR text i.e. non-alphanumeric characters - jrepl "[^a-zA-Z0-9\s]" "" /x /f %%G /O -
  2. Remove 1, 2 and 3 letter words - jrepl "\b\w{1,3}\b" "" /x /f %%G /O -
  3. Change all characters to lower case - jrepl "(\w)" "$1.toLowerCase()" /i /j /x /f %%G /O -
  4. To be able to sort the remaining words they need to be on separate new lines - jrepl "\s" "\n" /x /f %%G /O -
  5. Finally sort all unique words into alphabetic order and create the modified TXT file - sort /UNIQUE %%G /O %%G

I have a batch file that does the above using JREPL but it is very slow. It has been running for over 100 HOURS and I'm not even half way. Any suggestions so as to speed up the processing? I am running Windows 10. Thanks.

John
  • 1
  • 2
  • 1
    I would highly suggest you post this Question to the [DosTips forums](https://www.dostips.com/forum/viewtopic.php?t=6044) where the developer maintains the code. I am betting that executing JREPL three times is part of the problem whereas you could combine all that code into one JREPL execution. – Squashman Jan 22 '21 at 17:36
  • So - show us the batch file you're using by editing it into your question, which may prevent this from being closed as a coding-request – Magoo Jan 22 '21 at 17:37
  • This is probably a good case for not using a batch file and instead writing a small, purpose built utility in python, java, c# or something similar. – PaulProgrammer Jan 22 '21 at 17:39
  • What is `jrepl`? I tried it on my Windows-10 computer and it did not recognise it. Why don't you use a Linux subsystem on your computer for such kind of task? – Dominique Jan 22 '21 at 17:42
  • As posted, the `jrepl` being used seems to be an executable of some variety, not `jrepl.bat`. Since `/unique` is not an option available on `cmd`'s command-set, I suspect that this is not a Windows batch-file question. I'll de-tag in 15 minutes or so, pending OP's clarification. – Magoo Jan 22 '21 at 18:09
  • 1
    @magoo - `/unique` is an undocumented option for `sort` in Windows 10. – SomethingDark Jan 22 '21 at 18:24
  • In a sentence I'm surprised to see myself typing, I strongly recommend using Perl for this; it has regex support built in, and it's *designed* to process large amounts of text quickly. – SomethingDark Jan 22 '21 at 18:25
  • 1
    @SomethingDark : Hmm. That'll save me using `sed` - thanks. – Magoo Jan 22 '21 at 18:31
  • @SomethingDark, JREPL does all of its processing with Jscript which has regular expression capability. The Batch portion of JREPL is just wrapper for usability. – Squashman Jan 22 '21 at 18:37
  • @John the StackExchange network does have a specific site for [Code Review](https://codereview.stackexchange.com/). StackOverFlow is specifically if you need help with problems you have with existing code. Meaning it is not executing or not giving you the output you expect. – Squashman Jan 22 '21 at 18:51
  • Thanks everyone for your comments. I should have included the full version of the Batch I am using which is: Setlocal EnableDelayedExpansion for %%G in (*.txt) do ( set old=%%G echo !old! @echo on rem remove non-alphanumeric call jrepl "[^a-zA-Z0-9\s]" "" /x /f %%G /O - rem remove 1, 2 and 3 letter words call jrepl "\b\w{1,3}\b" "" /x /f %%G /O - rem all to lowercase call jrepl "(\w)" "$1.toLowerCase()" /i /j /x /f %%G /O - rem replace spaces with new lines call jrepl "\s" "\n" /x /f %%G /O - rem reduce to unique words sort /UNIQUE %%G /O %%G ) pause – John Jan 23 '21 at 08:28

1 Answers1

0

Solution?

Since your existing batch does what you want, no doubt testing a replacement will occupy some hours - so:

Split the 17,000 files - or those that remain unprocessed into (however many cores you have) separate directories, then start your existing batch on each directory. Since it's the weekend, leave the process running overnight. 8 cores? should be done in 15 hours or so, while you catch up on sleep or gardening or whatever.

Magoo
  • 77,302
  • 8
  • 62
  • 84