0

I have a batch file that processes scanned PDFs using ghostscript. One of the user prompts is for the resolution of the desired output. I wrote a crude autodetect routine like this:

for /f "delims=" %%a in ('findstr /C:"/Height 1650" %1') do set resdect=150
for /f "delims=" %%a in ('findstr /C:"/Height 3300" %1') do set resdect=300
for /f "delims=" %%a in ('findstr /C:"/Height 6600" %1') do set resdect=600
echo %resdect% DPI detected.

%1 is the filename passed to the batch script.

This should return the the highest resolution detected of some common sizes we see. My question to the community is: Is there a faster or more efficient way to do this other than search the file multiple times?

MattD
  • 150
  • 11
  • 1
    1. it's `%%a` but not `%%aa`. 2. write `"%~1"` instead of `%1`. 3. `resdect` is the `/Height` value divided by `11`, right? – aschipfl Apr 05 '18 at 19:32
  • @aschipfl - `"%~1` is not needed - `%1` will simply preserve any quotes that may or may not be there. If the file path contains spaces or poison characters, then the value will already be quoted, so it should work. If no space or poison character, then it works either way, with or without quotes. – dbenham Apr 05 '18 at 20:10
  • @aschipfl the %%aa was a typo (I manually transcribed the batch from a different machine). Edited code above – MattD Apr 05 '18 at 20:32
  • @dbenham, there might be cases where `%1` and `"%~1"` differ: if a file `foo&bar.ext` is provided as an unquoted argument, hence `foo^&bar.ext`, the `&` is going to appear unquoted when using `%1`; that is why I recommended `"%~1"`; I have to admit it's a constructed case though... – aschipfl Apr 05 '18 at 23:24

4 Answers4

4

Assuming that the value of RESDECT is the /Height value divided by 11, and that no line contains more than one /Height token, the following code might work for you:

@echo off
for /F delims^=^ eol^= %%A in ('findstr /R /I /C:"/Height  *[0-9][0-9]*" "%~1"') do (
    set "LINE=%%A"
    setlocal EnableDelayedExpansion
    set "RESDECT=!LINE:*/Height =!"
    set /A "RESDECT/=11"
    echo/!RESDECT!
    endlocal
)

If you only want to match the dedicated /Height values 1650, 3300, 6600, you could use this:

@echo off
for /F delims^=^ eol^= %%A in ('findstr /I /C:"/Height 1650" /C:"/Height 3300" /C:"/Height 6600" "%~1"') do (
    set "LINE=%%A"
    setlocal EnableDelayedExpansion
    set "RESDECT=!LINE:*/Height =!"
    set /A "RESDECT/=11"
    echo/!RESDECT!
    endlocal
)

To gather the greatest /Height value appearing in the file, you can use this script, respecting the aforementioned assumptions:

@echo off
set "RESDECT=0"
for /F delims^=^ eol^= %%A in ('findstr /R /I /C:"/Height  *[0-9][0-9]*" "%~1"') do (
    set "LINE=%%A"
    setlocal EnableDelayedExpansion
    set "HEIGHT=!LINE:*/Height =!"
    for /F %%B in ('set /A HEIGHT/11') do (
        if %%B gtr !RESDECT! (endlocal & set "RESDECT=%%B") else endlocal
    )
)
echo %RESDECT%

Of course you can again exchange the findstr command line like above.


Here is another approach to get the greatest /Height value, using (pseudo-)arrays, which might be faster than the above method, because there are no extra cmd instances created in the loop:

@echo off
setlocal
set "RESDECT=0"
for /F delims^=^ eol^= %%A in ('findstr /R /I /C:"/Height  *[0-9][0-9]*" "%~1"') do (
    set "LINE=%%A"
    setlocal EnableDelayedExpansion
    set "HEIGHT=!LINE:*/Height =!"
    set /A "HEIGHT+=0, RES=HEIGHT/11" & set "HEIGHT=0000000000!HEIGHT!"
    for /F %%B in ("$RESOLUTIONS[!HEIGHT:~-10!]=!RES!") do endlocal & set "%%B"
)
for /F "tokens=2 delims==" %%B in ('set $RESOLUTIONS[') do set "RESDECT=%%B"
echo %RESDECT%
endlocal

At first all heights and related resolutions are collected in an array called $RESOLUTIONS[], where the /Height values are used as indexes and the resolutions are the values. The heights become left-zero-padded to a fixed number of digits, so set $RESOLUTIONS[ return them in ascending order. The second for /F loop returns the last arrays element whose value is the greatest resolution.

I do have to admit that this was inspired by Aacini's nice answer.

aschipfl
  • 33,626
  • 12
  • 54
  • 99
  • will output every Height. According to the question, just the highest one is desired. As it is not sure, whether there is just one or more occurences and if or how they are sorted, `findstr` with multiple strings needs some postprocessing. – Stephan Apr 05 '18 at 20:08
  • You're right, @Stephan, obviously I didn't read carefully enough; see my updated answer... – aschipfl Apr 05 '18 at 23:50
  • I am always amazed by what can be accomplished with a batch file! I couldn't believe it until I saw the data, but for a complex PDF (color scans, lots of 1-bit overlays on 8-bit background), your 1st method is the fastest @ 5.9s, your 4th method (modified to look for specific heights) is the second fastest @ 6.5s, followed by 2nd method@ 14s. My code that I assumed to be slow clocked in at 1.6s (checking for 4 resolutions) – MattD Apr 06 '18 at 14:31
2

get the corresponding line to a variable and work with that instead of the whole file. Instead of your three for loops, you can use just one, when you change the logic a bit:

@echo off
setlocal enabledelayedexpansion
for /f "delims=" %%a in ('findstr /C:"/Height " %1') do (
  set "line=%%a"
  set "line=!line:*/Height =!"
  for /f "delims=/ " %%b in ("!line!") do set "hval=!hval! %%b" 
)
for %%a in (1650,3300,6600) do @(
  echo " %hval% " | find " %%a " >nul && set /a resdect=%%a/11
)
echo %resdect% DPI detected.

A solution with jrepl.bat could look something like:

for /f %a in ('type t.txt^|find "/Height "^|jrepl ".*/Height ([0-9]{4}).*" "$1"^|sort') do set /a dpi==%a / 11

(given, all valid Heights have 4 digits)
Note: for use in batchfiles, use %%a instead of %a
I barely scratched the surface of jrepl - I'm quite sure, there is a much more elegant (and probably faster) solution.

Stephan
  • 53,940
  • 10
  • 58
  • 91
  • I'm afraid this processes only the *last* line containing a `/Height` token, we don't know how many may occur though; anyway, I'd change to `find "/Height %%a"` in order not to match something like `/Width 1650`... – aschipfl Apr 05 '18 at 23:55
  • @aschipfl: you are (were) completely right. Shouldn't code late at night... Corrected. – Stephan Apr 06 '18 at 05:51
  • I see... ;-) Alright, so you process the entire file now (given it is not bigger than 8 KiB), but still something like `/Width 6600 /Height 3300` would result in `6600` (`resdect=600`) erroneously... – aschipfl Apr 06 '18 at 11:51
  • @Stephan Your code so far clocked the fastest by a large margin with a simple (125 pages, all @300 dpi) test file but I get a `The input line is too long. The syntax of the command is incorrect.` error when trying a more complex file (60 pages scanned in color which usually results in a base image for each page and a number of overlay images where the pdf encoder tries to overlay a 1-bit image. In this file, the line returned by findstr is over 160 char long in some places – MattD Apr 06 '18 at 14:10
  • On another test file I got an error that said `<< was unexpected at this time` I know the << appears in some of the PDF tags so we need to escape or quote it somehow. On a test file that your code sample worked on, it clocked in a 0.29s compared to 1.05s for the next fastest – MattD Apr 06 '18 at 14:40
  • @aschipfl: `:/` corrected. MattD: there are various limits of line length/string length in `cmd`. I cutted the lines down to the minimum for concatenation (`hval`), but when the *input line* is too long, there is nothing, we can do (in pure batch). Some preprocessing with [jrepl.bat](http://www.dostips.com/forum/viewtopic.php?f=3&t=6044) could help. (would you mind to check the speed again? I wonder how much speed the `set` commands will "eat") – Stephan Apr 06 '18 at 15:06
  • Looks like what is happening is that all the data after `/Height xyz` is still part of `line` so `hval` grows massively so by the time you try to pipe `hval` into find, it has grown too big. In this test file the initial line is ~160 characters long and doesn't have spaces between parameters: `[...] /Image/Width 208/Height 308/BitsPerComponent 1/ImageMasktrue/Filter/CCITTFaxDecode [...]` the `/BitsPerComponent ` part is filling up `hval` – MattD Apr 06 '18 at 16:56
  • excerpt of `hval`: `1650/BitsPerComponent 160/BitsPerComponent 2492/BitsPerComponent 1650/BitsPerComponent 1284/BitsPerComponent `. 'hval' was over 8000 characters after finishing the first for loop. I could filter on known useful height values which should drastically reduce the number of findstr hits – MattD Apr 06 '18 at 17:00
  • 1
    Ah - I assumed, there would be a space after the number, but obviously, there is a `/` (is it reliable?) The `for /f %%b` loop takes the first token, so adding `delims=/` should solve it. – Stephan Apr 06 '18 at 17:02
  • If there is a space between `/Height xxx` and the next term (typically `/BitsPerComponent' then the code works perfectly – MattD Apr 06 '18 at 17:03
  • Try `findstr /C:"/Height 1650" /C:"/Height 3300" /C:"/Height 6600"` with the first `for` – Stephan Apr 06 '18 at 17:04
  • After `/Height xxx` I have seen either an EOL, space, or `/`. Filtering on common heights drastically reduces the size of `hval` as expected-enough that appending a /BitsPerComponent doesn't overflow the cmd line for the sizes of files I have tested – MattD Apr 06 '18 at 17:24
  • Scanned a 773 page scanned color PDF (156mb) in about 0.7s with your code vs 1.3s for mine. Marked as answer, thanks for the help! – MattD Apr 06 '18 at 18:16
2

You may directly convert the Height value into the highest resolution in a single operation using an array. However, to do that we need to know the format of the line that contain the Height value. In the code below I assumed that the format of such a line is /Height xxxx, that is, that the height is the second token in the line. If this is not true, just adjust the "tokens=2" value in the for /F command.

EDIT: Code modified as requested in comments

In this modified code the Height value may appear anywhere in the line.

@echo off
setlocal EnableDelayedExpansion

rem Initialize "resDect" array
for %%a in ("1650=150" "3300=300" "6600=600") do (
   for /F "tokens=1,2 delims==" %%b in (%%a) do (
      set "resDect[%%b]=%%c"
   )
)

set "highResDect=0"
for /F "delims=" %%a in ('findstr "/Height" %1') do (
   set "line=%%a"
   set "line=!line:*/Height =!"
   for /F %%b in ("!line!") do set /A "thisRectDect=resDect[%%b]"
   if !thisRectDect! gtr !highResDect! set "highResDect=!thisRectDect!"
)

echo %highResDect% DPI detected.
Aacini
  • 65,180
  • 12
  • 72
  • 108
  • Unfortunately, I can't make any assumptions about the line containing height. In one file you might have `/Height 3300` on its own line, other times you might see something like `6 0 obj<< /Type /XObject /Subtype /Image /Name /Obj4 /Width 2550 /Height 3300 /ColorSpace /DeviceGray /BitsPerComponent 1 [...]`. It all depends on what scanner was used. Open a few PDFs with images in a text editor to see what I mean – MattD Apr 06 '18 at 11:16
0

For the record, the final code was:

setlocal enabledelayedexpansion
set resdetc=0
for /f "delims=" %%a in ('findstr /C:"/Height " %1') do (
  set "line=%%a"
  set "line=!line:*/Height =!"
  for /f "delims=/ " %%b in ("!line!") do set "hval=!hval! %%b" 
)
for %%a in (1650,3300,6600) do @(
  echo " %hval% " | find " %%a " >nul && set /a resdetc=%%a/11
)
if %resdetc%==0   SET resDefault=3
if %resdetc%==150 SET resDefault=1
if %resdetc%==300 SET resDefault=3
if %resdetc%==600 SET resDefault=6

ECHO.
ECHO Choose your resolution
ECHO ----------------------
ECHO 1. 150    4. 400
ECHO 2. 200    5. 500
ECHO 3. 300    6. 600
ECHO.
IF NOT %RESDETC%==0 ECHO 7. Custom    (%resdetc% DPI input detected)
IF     %RESDETC%==0 ECHO 7. Custom
ECHO ----------------------
choice /c 1234567 /T 3 /D %resDefault% /N /M "Enter 1-7 (defaults to %resDefault% after 3 sec.): "
IF errorlevel==7 goto choice7
IF errorlevel==6 set reschoice=600 & goto convert
IF errorlevel==5 set reschoice=500 & goto convert
[...]

Thanks everyone for the help!

MattD
  • 150
  • 11