Try:
parallel --tag -j 2 ocrmypdf --sidecar txt/{.}.txt {} out/{} ::: *.pdf
The .pdf's after the curly brackets (i.e. "{}.pdf
") are extraneous and will result in inability to locate the input file(s) ("{}
" captures the extension as well by default), and for the text one, by adding the period inside the brackets, that auto-removes the extension so you'll end up with "....txt
" instead of "....pdf.txt
" files (where "..." = identical filenames matching the inputs)
If the above doesn't work, likely due to having filenames with whitespaces in them, or some other characters messing with parallel's parsing (like quote(s) characters in the filename, parentheses, etc.), instead try using a file as the input:
Troubleshooting Solution - Create a File as the Input to parallel
I believe this should work. To avoid the fuss with quotes, I first created a file with the names of all the pdfs (full relative paths from cwd):
[g]ls --color=none *.pdf | parallel -q printf '%s'\\n {} > ocrmypdf.list
or
[g]ls --color=none -N *.pdf > ocrmypdf.list
The important thing is that no single quotes are introduced in the printed filenames in the .list
file -- the quoting should be "literal", e.g.:
like this:
Tritone Substitution sheet music.pdf
not like this:
'Tritone Substitution sheet music.pdf'
Then you can run the parallel ocrmypdf command, like so:
parallel -j 2 ocrmypdf --sidecar txt/{.} {} out/{} :::: ocrmypdf.list
Also notice the 4 ::::
vs usual three, because it's reading from a file. This will default to one full filename argument per line ran in parallel, so, no worries if there are spaces etc in the pdf filenames in the input file.