0

Think of strings, such as:

I have two apples
He has 4 apples 
They have 10 pizzas

I would like to substitute every digit number I find with in a string with a different value, calculated with an external script. In my case, the python program digit_to_word.py convert a digit number to the alphabetic format, but anything will be ok so that I can get the process.

Expected output:

I have two apples
He has four apples 
They have ten pizzas

Conceptually:

echo "He has four apples" |
while read word;
do
    if [[ "$word" == +([0-9+]) ]]; then
    NUM='${python digit_to_word.py "$word"}'
    $word="$NUM"
fi
done |
other_operation... | etc..

I say conceptually because I did not get even close to make it work. It is hard to me to even find information on the issue, simply because I do not exactly know how to conceptualize it. At this point, I am mostly reasoning on process substitution, but I am afraid it is not the best way.

Any hint that could be really useful. Thanks in advance for sharing your knowledge with me!

Worice
  • 3,847
  • 3
  • 28
  • 49
  • `NUM=$(python digit_to_word.py "$word"); word=$NUM`. Parameter expansions or command substitutions aren't honored in single quotes, and `${...}` is parameter expansion syntax (for shell variables) whereas you needed a command substitution (for external commands). And you can't have a `$` on the left-hand side of an assignment. And setting variables in pipelines generally doesn't do anything useful, because in general (exceptions exist but are out-of-scope of this comment), pipelines are run in subshells which shut down as soon as the pipeline completes. – Charles Duffy Jul 10 '19 at 15:55
  • Also, `+([0-9+])` matches `4`, but it doesn't match `four`, so how is that logic supposed to decide to call `digit_to_word` when given `four apples` as input? – Charles Duffy Jul 10 '19 at 15:57
  • It is correct, it should not be called for `four`, but only digits sequences, such as `4` or `10`. – Worice Jul 10 '19 at 16:11
  • Gotcha. Still, you aren't replacing the values in the string anywhere in this code. `$pos = value` is valid *awk* code to replace a value in the field whose number is stored in the variable `pos`, but that's awk, not bash. – Charles Duffy Jul 10 '19 at 16:27
  • 1
    Also, `cat "He has four apples"` is trying to open a *file with the name* `He has four apples`. – Charles Duffy Jul 10 '19 at 16:28
  • Thanks for your valuable suggestions, I will work on them! I hope I will figure out a solution. – Worice Jul 10 '19 at 16:30

4 Answers4

2
regex='([[:space:]])([0-9]+)([[:space:]])'

echo "He has 4 apples" |
while IFS= read -r line; do
  line=" ${line} "  # pad with space so first and last words work consistently
  while [[ $line =~ $regex ]]; do       # loop while at least one replacement is pending
    pre_space=${BASH_REMATCH[1]}                # whitespace before the word, if any
    word=${BASH_REMATCH[2]}                     # actual word to replace
    post_space=${BASH_REMATCH[3]}               # whitespace after the word, if any
    replace=$(python digit_to_word.py "$word")  # new word to use
    in=${pre_space}${word}${post_space}         # old word padded with whitespace
    out=${pre_space}${replace}${post_space}     # new word padded with whitespace
    line=${line//$in/$out}                      # replace old w/ new, keeping whitespace
  done
  line=${line#' '}; line=${line%' '}            # remove the padding we added earlier
  printf '%s\n' "$line"                         # write the output line
done

This is careful to work even in some tricky corner cases:

  • 4 score and 14 years ago only replaces the 4 in 4 score with four, and doesn't also modify the 4 in 14.
  • Input that mixes tabs and whitespaces generates output with the same kinds of whitespace; printf '1\t2 3\n' as your input, and you'll get a tab between one and two, but a space between two and three.

See this running at https://ideone.com/SOsuAD

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
2

I'd suggest this is a better job for perl.

To recreate the scenario:

$ cat digit_to_word.sh
case $1 in
4) echo four;;
8) echo eight;;
10) echo ten;;
*) echo "$1";;
esac
$ bash digit_to_word.sh 10
ten

Then this

perl -pe 's/(\d+)/ chomp($word = qx{bash digit_to_word.sh $1}); $word /ge' <<END
I have two apples
He has 4 apples
They have 10 pizzas but only 8 cookies
END

outputs

I have two apples
He has four apples
They have ten pizzas but only eight cookies

However, you've already got some python, why don't you implement the replacement part in python too?

glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • I don't know perl's `qx` behavior, and am curious -- if the pattern weren't limited to digits only, would we need to worry about command injection attacks? – Charles Duffy Jul 10 '19 at 17:12
  • qx is synonymous with backtick, so yes, if we weren't prefiltering the input with a regex then we'd need to do more work – glenn jackman Jul 10 '19 at 17:21
  • 1
    for example: `system "bash", "digit_to_word.sh", $anything` will *not* be interpreted by a shell, so is safer. It also does not conveniently return stdout which is a bummer. – glenn jackman Jul 10 '19 at 17:39
  • Thanks a lot for your answer! By now, I am not considering pearl as an option, but one day I am afraid it will not be only an option anymore, hence I will dive into it! – Worice Jul 11 '19 at 06:39
1

Revision

This approach decomposes each line into two arrays - one for the words and one for the whitespace. Each line is then reconstructed by interleaving the array elements, with digits translated to words by the Python script. Thanks to @Charles Duffy for pointing out some common Bash pitfalls with my original answer.

while IFS= read -r line; do
  # Decompose the line into an array of words delimited by whitespace
  IFS=" " read -ra word_array <<< $(echo "$line" | sed 's/[[:space:]]/ /g')

  # Invert the decomposition, creating an array of whitespace delimited by words
  IFS="w" read -ra wspace_array <<< $(echo "$line" | sed 's/\S/w/g' | tr -s 'w')

  # Interleave the array elements in the output, translating digits to text
  for ((i=0; i<${#wspace_array[@]}; i++))
  do
    printf "%s" "${wspace_array[$i]}"
    if [[ "${word_array[$i]}" =~ ^[0-9]+$ ]]; then
      printf "%s" "$(digit_to_word.py ${word_array[$i]})"
    else
      printf "%s" "${word_array[$i]}"
    fi
  done
  printf "\n"
done < sample.txt
cdub
  • 1,420
  • 6
  • 10
  • 2
    See [DontReadLinesWithFor](https://mywiki.wooledge.org/DontReadLinesWithFor) and [BashPitfalls #1](http://mywiki.wooledge.org/BashPitfalls#for_f_in_.24.28ls_.2A.mp3.29), as well as [BashFAQ #1](http://mywiki.wooledge.org/BashFAQ/001) describing the safer practice. – Charles Duffy Jul 10 '19 at 16:35
  • 1
    Moreover, `word_array=($line)` falls afoul of [BashPitfalls #50](http://mywiki.wooledge.org/BashPitfalls#hosts.3D.28_.24.28aws_....29_.29). If you have a word of `*`, it'll be replaced with a list of files in the current directory. And you're losing details around the whitespace -- how much is used, whether it's a tab or a space, where the newlines are, etc. – Charles Duffy Jul 10 '19 at 16:37
  • (That's in part because the `for line in` is actually assigning not lines but individual words to `$line`; see the links for more details on why). – Charles Duffy Jul 10 '19 at 16:37
  • ...also, `printf` format strings should be constant. Use `printf '%s ' "$word"` instead so a double backslash doesn't change to a single one, `\t` doesn't change to a tab, `%%` doesn't change to `%`, etc. – Charles Duffy Jul 10 '19 at 16:38
  • Wow. Great tips here. Thanks! I'll plan to revise my answer with these pitfalls in mind. – cdub Jul 10 '19 at 16:51
  • 1
    `while read -r -a word_array; do ...` is better than `for line in ...` + `word_array=($line)` – glenn jackman Jul 10 '19 at 16:53
  • Thanks a lot for your answer! It really gave me useful insight on the mechanisms I should consider. – Worice Jul 11 '19 at 06:37
0

You could use sed for this. Here's an example:

$ echo "He has 4 apples" | sed 's/4/four/'
He has four apples

Looking at the example data though, sed might not be a good fit. If you see "1", you want to replace with "one", but your example replaced "10" with "ten". Do you need to support multi-digit numbers, such as replacing "230" with "two hundred and thirty"?

Kaan
  • 5,434
  • 3
  • 19
  • 41
  • If a series of digit is met, the python script operates a transformation, in my intentions. – Worice Jul 10 '19 at 16:09