2

I'm trying to do some batch processing using a package called ocrmypdf.

Here is a command that can process 1 pdf file

ocrmypdf input.pdf output.pdf

and here is a command that can process all pdf files in the directory we run it in.

parallel --tag -j 2 ocrmypdf '{}' 'output/{}' ::: *.pdf

Now, I actually want to run this command for all pdf files in the directory. This one takes one more parameter.

ocrmypdf --sidecar txt/input.txt input.pdf out/output.pdf

I tried rewriting the parallel command earlier like this:

parallel --tag -j 2 ocrmypdf --sidecar txt/{}.txt {}.pdf out/{}.pdf ::: *.pdf

But I get the error:

ocrmypdf: error: the following arguments are required: output_pdf

Can someone help me understand what I'm doing wrong? Thanks!

SkV
  • 60
  • 1
  • 1
  • 11
  • This might help: https://stackoverflow.com/tags/gnu-parallel/info – Cyrus Oct 14 '21 at 22:04
  • Maybe try quoting the whole command (the part beginning after `-j 2` and before `:::`), as Cyrus refers, there's a section on quoting in the manual. Also, maybe try adding in explicitly the options `--output` (or w/e the correct exact spelling is for `ocrmypdf`)? You could also use the parallel option `--joblog logfile` and that might save some clue info for you to help troubleshoot...! – John Collins Oct 14 '21 at 22:16
  • 2
    Oh I think you also should get rid of the `{}.pdf` extra .pdf's. Because that will give .pdf.pdf. Add `--dryrun` to your parallel command, and it will print exactly what commands it will run without actually running them. So maybe your tool is complaining in an ambiguous way because it failed to find any *.pdf.pdf input file(s) and that messed up the proceeding parsing of the output file argument – John Collins Oct 14 '21 at 22:25
  • 1
    @SkV I believe my "potential solution" in updated answer should work for you! fyi – John Collins Oct 15 '21 at 19:32

2 Answers2

2

This works for me:

parallel --tag -j 2 ocrmypdf --sidecar txt/{.}.txt {} out/{} ::: *.pdf

If it does not work for you:

  • Identify a failing file
  • Run the failing file by hand to check that this works
  • Edit your question to include a link to the failing file

(Also be aware of this bug when running multiple tesseracts: https://github.com/tesseract-ocr/tesseract/issues/3109#issuecomment-703845274)

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
1

Try:

parallel --tag -j 2 ocrmypdf --sidecar txt/{.}.txt {} out/{} ::: *.pdf

The .pdf's after the curly brackets (i.e. "{}.pdf") are extraneous and will result in inability to locate the input file(s) ("{}" captures the extension as well by default), and for the text one, by adding the period inside the brackets, that auto-removes the extension so you'll end up with "....txt" instead of "....pdf.txt" files (where "..." = identical filenames matching the inputs)

If the above doesn't work, likely due to having filenames with whitespaces in them, or some other characters messing with parallel's parsing (like quote(s) characters in the filename, parentheses, etc.), instead try using a file as the input:

Troubleshooting Solution - Create a File as the Input to parallel

I believe this should work. To avoid the fuss with quotes, I first created a file with the names of all the pdfs (full relative paths from cwd):

[g]ls --color=none *.pdf | parallel -q printf '%s'\\n {} > ocrmypdf.list

or

[g]ls --color=none -N *.pdf > ocrmypdf.list

The important thing is that no single quotes are introduced in the printed filenames in the .list file -- the quoting should be "literal", e.g.:

like this:

Tritone Substitution sheet music.pdf

not like this:

'Tritone Substitution sheet music.pdf'

Then you can run the parallel ocrmypdf command, like so:

parallel -j 2 ocrmypdf --sidecar txt/{.} {} out/{} :::: ocrmypdf.list

Also notice the 4 :::: vs usual three, because it's reading from a file. This will default to one full filename argument per line ran in parallel, so, no worries if there are spaces etc in the pdf filenames in the input file.

John Collins
  • 2,067
  • 9
  • 17
  • Are there situations where `ls --color=none *.pdf | parallel -q printf '%s'\\n {} > ocrmypdf.list` and `ls --color=none *.pdf > ocrmypdf.list` do not give the same? – Ole Tange Oct 18 '21 at 06:55
  • @OleTange The issue I observed when testing to see if my answers would work was when file names had whitespace characters in them – John Collins Oct 19 '21 at 00:17
  • But yes, for me, there is a difference. Without piping it to printf via parallel, I get a bunch of leading single quotes surrounding the filenames (_if_ they have any whitespace). If I remember correctly that messed up running the main parallel ocrmypdf command, hence why I had to make creation of the file a more complex line of code! @OleTange – John Collins Oct 19 '21 at 01:09
  • For example, running `ls --color=none *.pdf` in my ~/Downloads folder (I'll share just a tiny snippet, lol, of the very random and various private pdfs I've not managed to properly sort away that have collected there over the years) -- I'll update this in my answer – John Collins Oct 19 '21 at 01:12
  • 1
    You have something that causes `ls` not to run `ls --quoting-style=literal */*' '*pdf`. This may be an alias or the QUOTING_STYLE environment variable. I have only seen this when stdout was not redirected (in other words: `ls | cat` would not do the quoting). – Ole Tange Oct 19 '21 at 21:04
  • Indeed it turns out I am using `gls` technically (on macOS terminal), aliased so that `ls=gls`. Another solution is to use the `-N` option if like me you're also using GNU's ls from _coreutils_ – John Collins Oct 19 '21 at 22:15