1

Goal:

I have a sneaking suspicion that I'm globbing incorrectly due to not being able to find a satisfactory explanation with multiple clear examples of advanced string-and-var mixing.

The operation I am trying to perform is on the last line, and the goal is to output the outputdirectory + filebasename + outputextension. Unfortunately, there are too many variables, and despite reading multiple manuals, I feel certain I am making mistakes.

#!/bin/bash

echo Input directory name like ./path/to: 
read -r varin

echo Input directory name like ./path/to: 
read -r varout

if [ ! -d "${varout}" ]; then
  mkdir -p "${varout}";
fi

for file in ${varin}; do pconvert -i "${file}" -o "${varout}"/"${file%%.*}".txt; done

error: File './inputs/outputs/*/.txt' already exists. Overwrite ? [y/N] ^C

Unexpected behavior:

  1. I have to write ./inputs/* instead of ./inputs, and this is unexpected. I expected bash to look for a directory then loop through the files in that directory: this is fine, but it shows that I am not comprehending the code.
  2. Presuming I type ./inputs/outputs/*, this script tries to create ./inputs/outputs/*.txt on each iteration rather than ./inputs/outputs/inputname.txt. The goal in the last operation on line 15 is to scrub the directory, scrub the extension, and use the new path + basename + newextension. Kind of the blind leading the blind, but I feel like this can only have something to do with my use of quotation marks?

Resources I've used:

According to this link, I should probably do something like this:

convertdoc -i "$'{file}'" --pdfconvert -o "$'{outputDir}'/$'{file%%.*}'.odf

But I am getting mixed opinions from friends. So far, I've been told to use no trailing quote, to only use semiquotes, to use quotes both prior to and after the dollar sign, and to be pipe down, to mention a few.

Sample inputs:

$HOME/pdfdl/ardvarks.pdf
$HOME/pdfdl/ants.pdf
$HOME/pdfdl/canines.pdf
$HOME/pdfdl/cats.tmp.pdf
John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Wolfpack'08
  • 3,982
  • 11
  • 46
  • 78
  • 1
    cut-n-paste your code (along with shebang) into [shellcheck.net](https://www.shellcheck.net/) and make the suggested changes; first (glaring) issue ... `for file in "$varin;"` should be `for file in "$varin";` (the semicolon is `bash` syntax and as currently coded is actually treated as part of the variable reference) – markp-fuso Nov 25 '22 at 15:33
  • 1
    runtime issues may occur based on actual input values, so providing some sample inputs that cause issues would likely also help – markp-fuso Nov 25 '22 at 15:35
  • 1
    `for file in "$varin"; do ...; done` is the same as `file=$varin; ...;` – Fravadona Nov 25 '22 at 15:35
  • 1
    You probably want for `for file in "$varin"/*`. Also the variable `varout` isn't used anywhere and the variables `outdir` and `outputDir` aren't initialized. – M. Nejat Aydin Nov 25 '22 at 15:36
  • 1
    also consider enabling debug mode (`set -x`; `set +x` to disable) at top of script, run script, and review the debug output to see what `bash` is doing; if the intention is to read through a list of files in the `$varin` directory you propably want something like `for file in "$varin"/*; do ...; done` – markp-fuso Nov 25 '22 at 15:37
  • @M.NejatAydin Thank you. This would presumably prevent me from being required to type /* at runtime...? – Wolfpack'08 Nov 27 '22 at 02:38

2 Answers2

0

Your script has a few defects. The "for" statement is not doing what you think it is. You haven't given it an expression to match/expand, so you only have a list of 1 item, namely varin, only the directory, not actual PDF files.

It isn't completely clear from your question what you are trying to convert, but the list of input filenames clarified that.

I try to use basic tools for linux so I use "pdftotext" instead of the two you mentionned above.

As for the "${file%%.*}", I prefer to make explicit some actions that implicit forms make too "arcane" for beginners/reviewing. I prefer to see the actual flow of how things are transformed, hence the use of basename in my version of your script below.

#!/bin/sh

START=`pwd`

echo "Input directory name (./path/to)  => \c"
#read -r varin
varin=${START}/TESTin

echo "Input directory name (./path/to)  => \c"
#read -r varout
varout=${START}/TESTout

if [ ! -d "${varout}" ]; then
  mkdir -p "${varout}";
fi

cd "${varin}"
if [ $? -ne 0 ] ; then  echo "\n Unable to set '${varin}' as work directory for input file scanning.\n" ; exit 1 ; fi

for file in *.pdf
do
    #pdfconvert -i "${file}" -o "${varout}"/"${file%%.*}".txt
    BASE=`basename "${file}" ".pdf" `
    #pdftotext -eol unix -nopgbrk "${file}" "${varout}/${file%%.*}.txt"
    pdftotext -eol unix -nopgbrk "${file}" "${varout}/${BASE}.txt" 2>>errlog
done
Eric Marceau
  • 1,601
  • 1
  • 8
  • 11
  • Please check your script with shellcheck. Quote variable expansion `cd "${varin}"`. No need to check if a directory doesn't exists with `mkdir -p` - he already does that. Do not use `[ $? -ne 0 ]` - prefer `if ! cd "$varin"; then`. Do not use backticks - use `$(...)`. – KamilCuk Nov 27 '22 at 08:31
  • @KamilCuk So with `$(...)` would it be `START=$(pwd)`? – Wolfpack'08 Nov 27 '22 at 16:30
  • I tried using both the version with and without edits suggested by KamilCuk: despite having created and populated TESTin, I get some error messages like `line 18: cd: ./TESTin: No such file or directory \n Unable to set './TESTin' as work directory for input file scanning.\n`. – Wolfpack'08 Nov 27 '22 at 16:34
  • `would it be START=$(pwd)` Yes, or just `START=$PWD` or really just use `$PWD` `I get some error messages like` weell, that suggests that the directory `./TESTin` doesn't exists. – KamilCuk Nov 27 '22 at 18:12
  • @KamilCuk, I explained why I do things the way I did them. They are not the most compact, or CPU efficient, but they make things explicit/visible to my taste ... and there is nothing wrong with the way they work. Yes, I should have quoted the "${varin}". That was an oversight. – Eric Marceau Nov 27 '22 at 18:44
  • With `set -x` this returns: `++ date '+DATE: %b '` and nothing else. Non-working. – Wolfpack'08 Nov 28 '22 at 04:46
  • @Wolfpack'08, not sure how that relates to your original question. Also what you are trying to say is not clear. Could you explain, or offer full script in your original question? – Eric Marceau Nov 28 '22 at 21:09
  • @EricMarceau I made some edits. It's just very difficult to understand all the various ways strings are placed with and without quotes because different sources give wildly different info. It looks like the for stuff should be `for file in "$files"/*; do`. I've gotten a variety of errors, and it seems primarily that shellcheck is saying to use single quotes then to use $(...), neither of which loop through files. The .pdf thing presumes the extension: the script should take some other extensions such as .jpg-to-txt, being able to manage a variety of extensions. – Wolfpack'08 Nov 29 '22 at 04:29
0

Consider using arguments. I would dp:

#!/bin/bash
varout=$1

mkdir -p "$varout"
for file in "$@"/*; do
      # https://stackoverflow.com/questions/965053/extract-filename-and-extension-in-bash
      filename="${file##*/}"
      filename_without_ext="${filename%.*}"

      pconvert -i "$file" -o "$2/$filename_without_ext".txt
done

And then do:

./script.sh /input /output

I have to write ./inputs/* instead of ./inputs, and this is unexpected. I expected bash to look for a directory then loop through the files in that directory

I do not understand your confusion. * expands to the list of entries inside a dir. If you type ./inputs that's just ./inputs, when you type ./inputs/* then on ${varin} it expands to the list of files. I would find it unexpected if both would mean the same.

Additionally, ${file%%.*} is invalid when the path contains another .. It removes the longest suffix that matches .*. When file=./anything/file.txt then echo "${file%%.*}" will output empty - because file=. starts with a dot, .* matches everything.

Presuming I type ./inputs/outputs/, this script tries to create ./inputs/outputs/.txt

No, the error message suggests it tries to create /inputs/outputs/*/.txt.

I do not understand how would you want the output to expand a glob expression. As you stated The goal in the ... use the new path, not multiple new paths, which * would expand to.

According to this link, I should probably do something like this: convertdoc -i "$'{file}'" --pdfconvert -o "$'{outputDir}'/$'{file%%.*}'.odf

A quoting style "$'{something}'" was never used in that link. Consider re-reading it.

Wolfpack'08
  • 3,982
  • 11
  • 46
  • 78
KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • 1
    (1) I do not understand, debug your script with `set -x`. (2) I do not understand where exactly is it appended. If you ask about the `"$varout/$filename_without_ext".txt` line then your requirement state ` use the new path + basename + newextension` - it's "newextension", not same extension. – KamilCuk Nov 27 '22 at 18:11
  • @KamilCuck thank you. What I'm wondering is, isn't there a redundancy from script.sh ./arg1 ./arg2/files.txt (here .txt) and pconvert -i "$file" -o "$varout/$filename_without_ext".txt (here .txt)? – Wolfpack'08 Nov 28 '22 at 04:41
  • I tried removing all options but -i and -o, and I get "file does not exist /ins/outs. `outs` is an empty directory that does exist. I don't know why I always get strange behavior. It just says "./ins/outs" is not a directory as if it refuses to tac the basename and extension on, despite running the code as listed and trying to troubleshoot it. – Wolfpack'08 Nov 28 '22 at 05:05