0

I've created a bash script to find dollar words. For those of you who don't know, a dollar word is a word whose values of their letters add up to 100 when A is given a value of 1, B is given a value of 2, C is 3, and all the way to Z is 26.

I am new to programming, so I've created a very crude script that'll do this kind of thing, but it's not working as fast as I would expect. Something in my code is slowing it down, but I don't know what. This is my code.

#!/bin/bash

#370101 total words in Words.txt

line=$(cat line.txt)

function wordcheck {
   letter=({a..z})
   i=0
   while [ "$i" -le 25 ]
   do
      occurences["$i"]=$(echo $word | grep ${letter["$i"]} -o | wc -l)

      ((i++))
   done
   ((line++))
}

until [ "$line" -ge "370102" ]
do

   word=$(sed -n "$line"p Words.txt)
   wordcheck

   echo "$line" > line.txt

   x=0

   while [ "$x" -le '25' ]
   do
      y=$((x+1))
      charsum["$x"]=$((${occurences[x]} * $y))
      ((x++))
   done

   wordsum=0

   for n in ${charsum[@]}
   do
      (( wordsum += n ))
   done

   tput el

   printf "Word #"
   printf "$(($line - 1))"

   if [ "$wordsum" = '100' ]
      then
         echo $word >> DollarWords.txt
         printf "\n\n"
         printf "$word\n"
         printf '$$$DOLLAR WORD$$$\n\n'
      else
         printf "            Not A Dollar Word            $word\n"
         tput cuu1
   fi
done

I can only speculate that it has something to do with the while loops, or with how it is constantly writing the value of $line to a file.

I've created a script before that adds numbers to generate the Fibonacci sequence, and it does it almost instantaneously.

So my question is, what are some ways to help my code run more efficiently? Apologies if this belongs on codereview.

Any help is highly appreciated.

Thanks

Edit:

Although I accepted Gordan Davisson's Answer, the other ones are just as good if you want to do this. I'd recommend reading everyone else's answer before giving this a try. Also, as numerous users have pointed out, bash is not a good language to be using for this. Again, Thanks to everyone for your suggestions.

Erik
  • 155
  • 1
  • 10
  • 5
    There are two main problems. One is that you are spawning ~100 new processes for each word from using command substitution and pipelines. The other is that you're reading the file from top to bottom once for every word. You can fix the first by using parameter expansion to extract letters and `printf -v myvariable "%d" "'x"` to find the ascii value of a letter "x" (subtract 96 to get point value), and the second by using a `while read` loop to read line by line. – that other guy Jun 25 '17 at 00:48
  • @thatotherguy Thanks! I tried it out and it works really well. If I were to use ascii, would it be faster than @gordondavisson's answer? Also, I want to be able to resume from where I left off with the `while read` loop. Do I just give `word` the word that was there? – Erik Jun 25 '17 at 02:59
  • If you care about throughput, Bash is really not the language you should be using. I would guess that it would be faster, but these commands are pretty far removed from the underlying operations so it's hard to predict. You can save and continue using `tail -n +1234 words.txt | while read ..` to continue from line 1234, but a good solution (maybe not in Bash) would churn through the 400,000 lines in 5 seconds – that other guy Jun 25 '17 at 03:21
  • @thatotherguy I wanted to try this in python as well, do you think it's a good language for this sort of thing? What language would you recommend? – Erik Jun 25 '17 at 03:25
  • 3
    Python would be an excellent choice for this. – that other guy Jun 25 '17 at 03:49
  • @thatotherguy Thank you so much for your advice! – Erik Jun 25 '17 at 03:51

5 Answers5

3

Given:

$ wc -l words.txt
370101 words.txt

(i.e., the 370,101 word file linked HERE)

In Bash alone, start with a loop that reads the file line by line:

c=0
while IFS= read -r word; do
    (( c+=1 ))
done <words.txt
echo "$c"
# prints 370,101

To count the lines alone in Bash (same file) takes over 7.8 seconds on my computer. wc in comparison executes in microseconds. So the Bash version is going to take a while.

Once you have the file word by word, you can read each word character by character and find the index of that character in a string of the alphabet:

lcl=' abcdefghijklmnopqrstuvwxyz'
ucl=' ABCDEFGHIJKLMNOPQRSTUVWXYZ'

while IFS= read -r word; do
    ws=0    
    for (( i=0; i<${#word}; i++ )); do  
        ch=${word:i:1}
        if [[ "$ch" == [a-z] ]]; then 
            x="${lcl%%$ch*}" 
            (( ws += "${#x}" ))
        elif [[ "$ch" == [A-Z] ]]; then
            x="${ucl%%$ch*}"    
            (( ws += "${#x}" ))
        fi  
    done
    if (( ws==100 )); then 
        echo "$word"
    fi          
done <words.txt

Prints:

abactinally
abatements
abbreviatable
abettors
abomasusi
abreption
...
zincifies
zinkify
zithern
zoogleas
zorgite

That takes about 1:55 on the 370,101 word file.

As a comparison, consider the same function in Python:

import string 

lets={k:v for v,k in enumerate(string.lowercase, 1)}
lets.update({k:v for v,k in enumerate(string.uppercase, 1)})

with open('/tmp/words.txt') as f:
    for word in f:
        word=word.strip()
        if sum(lets.get(c,0) for c in word)==100:
            print word

Far easier to understand and executes in 580 ms.

Bash is great for gluing together different tools. Is is not that great at large processing tasks. Use awk perl python ruby etc for larger tasks. Easier to write, read, understand and faster.

dawg
  • 98,345
  • 23
  • 131
  • 206
  • Very fast and very simple. I was just working on a python solution for this until I saw your's! Thank you so much for your help. – Erik Jun 26 '17 at 16:57
2

As @thatotherguy pointed out in a comment, there are two big problems here. First, the way you're reading lines from the file reads the entire file every line. That is, to read the first line you run sed -n "1"p Words.txt, which reads the entire file and prints only the first line; then you run sed -n "2"p Words.txt, which reads the entire file again and prints only the second line; etc. To fix this use a while read loop:

while read word; do
    ...
done <Words.txt

Note that if anything inside the loop tries to read from standard input, it'll steal some of the input from Words.txt. In that case you can send the file over FD #3 instead of standard input with while read -u3 ... done 3<Words.txt.

The second problem is this bit:

occurences["$i"]=$(echo $word | grep ${letter["$i"]} -o | wc -l)

...which creates 3 subprocesses (echo, grep, and wc), which isn't too bad except that this runs 26 times for each and every word in the file. Creating processes is expensive compared to most shell operations, so you should do your best to avoid it especially in loops that run many many times. Try this instead:

matches="${word//[^${letter[i]}]/}"
occurences[i]="${#matches}"

This works by replacing all characters that aren't ${letter[i]} with "", then looking at the length of the resulting string. The parsing happens entirely in the shell process, so it should be much faster.

Gordon Davisson
  • 118,432
  • 16
  • 123
  • 151
  • Thanks for the highly detailed answer! I'll give it a try and see how it goes. – Erik Jun 25 '17 at 02:01
  • I tried it out, but then I realized it doesn't start from where it left off from. How can I add this functionality? Apart from that however, those edits were super effective! – Erik Jun 25 '17 at 03:06
2

Note: skip down to #3 for a faster method.

  1. One loop, one (long) stream method:

    # make an Associative Array of the 26 letters and values.
    declare -A lval=\($(seq 26 | for i in [{a..z}] ; do read x; echo $i=$x ; done)\)
    # spew out 240,000 words from some man pages.
    man bash csh dash ksh busybox find file sed tr gcc perl python make | 
    tr '[:upper:][ \t]' '[:lower:]\n' | sort -u | 
    while read x ; do 
        [ "$x" = "${x//[^a-z]/}" ] && 
        (( 100 == $(sed 's/./lval[&]+/g' <<< $x) 0 )) && 
        echo "$x"
    done | head
    

    Output to print first 10 words, (about 13 seconds on an Intel Core i3-2330M):

    accumulate
    activates
    addressing
    allmulti
    analysis
    applying
    augments
    backslashes
    bashopts
    boundary
    

    How it works.

    1. Make all the words lower case, then sort uniquely.
    2. If a word contains only lowercase letters, run the test, and maybe print it.
    3. The test uses sed to convert a word (let's say "foo") to bash code like this
      (( ${lval[f]}+${lval[o]}+${lval[o]}+0 )), i.e. a list of associative array values to add up.
  2. Trick arrayless hexdump method, quite similar to the above except instead of the part with sed, it's replaced with:

    (( 100 == $( hexdump -ve '/1 "(%3i - 96) + " ' <<< $x ;) 86 ))
    

    Here hexdump dumps an equation using decimal ascii codes, (see man ascii and the "Examples" in man hexdump), that for the input "foo" would output this:

    (102 - 96) + (111 - 96) + (111 - 96) + ( 10 - 96) +
    

    The - 96 is an offset, but since hexdump even dumps the linefeed, (ascii 10 decimal), adding 86 at the end corrects for that.

    The code:

    while read x ; do 
        [ "$x" = "${x//[^a-z]/}" ] && 
        (( 100 == $( hexdump -ve '/1 "(%3i - 96) + " ' <<< $x ;) 86 )) &&
        echo "$x"
    done < words.txt
    

    It runs about 20% faster than the Associative Array method.

  3. Software tools pre-loop method, using paste and single instances of hexdump, sed, tr, and egrep. First make the list (3 seconds) as with markp's answer:

    man bash csh dash ksh busybox find file sed tr gcc perl python make | 
    tr '[:upper:][ \t]' '[:lower:]\n' | sort -u | egrep '^[a-z]+$' > words.txt 
    

    Then paste all the words next to their respective equations, (see previous answer), feed those into a loop, and print the dollar words:

    paste words.txt 
         <(hexdump -ve '/1 "%3i " ' < words.txt | 
           sed 's/ *[^12]10[^0-9] */\n/g;s/^ //;s/ $//' | 
           sed 's/ \+\|$/ + -96 + /g;s/ + $//'
           ) | 
    while read a b ; do (( 100 == $b )) && echo $a ; done
    

    Doing the processing before the loop is a big improvement. It takes about a second to print the entire list of dollar words.

    How it works: the thing needed is for the decdump (i.e. decimal dump) to put each word on a separate line. Since hexdump cannot do that, use sed to translate all the 10s, (i.e. the linefeed codes) into actual linefeeds, and then proceed much like method #2 above.

agc
  • 7,973
  • 2
  • 29
  • 50
  • Wow, much shorter than mine. Can it resume from where it last stopped? – Erik Jun 25 '17 at 16:04
  • If it's stopped with `Ctrl-z` from the keyboard, and restarted with the `fg` command, yes. – agc Jun 25 '17 at 16:07
  • Why the `eval`s? I'm having trouble understanding what this code is doing (and thus what potential failure modes it has, particularly wrt security -- ie. whether content in `x` can ever be interpreted as shell constructs). – Charles Duffy Jun 25 '17 at 22:44
  • 1
    Good start, but you don't need `eval+echo` and `(( ))` context doesn't need `${ }` and you've already checked all chars are a-z so: `(( $(sed 's/./lval[&]+/g' <<<$word)0 == 100 ))` – dave_thompson_085 Jun 26 '17 at 03:53
  • @dave_thompson_085: Thanks, that's much better, and it runs about 30% faster also. – agc Jun 26 '17 at 04:44
  • @CharlesDuffy, I've replaced the code with *dave_thompson_085*'s improvements, so it's a moot point. I believe it *was* a benign `eval` usage (or kludge) however... – agc Jun 26 '17 at 04:56
2

Since you're looking at ways to speed up the processing, here's a tweak of the solution provided by user agc.

I've pulled the man/tr/sort out and dumped the results to a file (Words.txt) in order to simulate the original problem where the file already exists (ie, I want to take the man/tr/sort out of the timings):

man bash csh dash ksh busybox find file sed tr gcc perl python make | tr '[:upper:][ \t]' '[:lower:]\n' | sort -u > Words.txt

The gist of this tweak is to replace the eval/sed sub-process invocation with a loop that steps through the characters of a valid word. [See the post - How to perform a for loop on each character in a string in BASH? - for more details; in particular see the solutions provided by users Thunderbeef and Six.]

#!/bin/bash
# make an Associative Array of the 26 letters and values.

declare -A lval=\($(seq 26 | for i in [{a..z}] ; do read x ; echo $i=$x ; done)\)

while read word
do
    # skip words that contain a non-letter
    [[ ! "${word}" =~ ^[a-z]+$ ]] && continue

    sum=0

    # process ${word} one character at a time

    while read -n 1 char
    do
        # here string dumps a newline on the end of ${word}, so we'll
        # run a quick test to break out of the loop for a non-letter

        [[ "${char}" != [a-z] ]] && break

        sum=$(( sum + lval[${char}] ))

    # from the referenced SO link - see above - the solutions of interest
    # use process substitution and printf to pass the desired string into
    # the while loop; I've replaced this with the 'here' string and added
    # the test to break the loop when we see the the newline character.

    #done < <(printf $s "${word}")
    done <<< "${word}"

    (( sum == 100 )) && \
    echo "${word}"

done < Words.txt

My timings (for the first 10 strings) from running 3 different tests in a linux VM running on an old i5:

  • agc's solution: 37 secs
  • above solution w/ process substitution: 11 secs
  • above solution w/ here string: 2.7 secs

EDIT: Some comments on what the various commands are doing ...

  • $(seq 26 | for/do/read/echo/done) : generates the list of strings "[a]=1 [b]=2 ... [z]=26"

  • declare -A lval=\( $(seq...done) \) : declares lval as an associative array and loads the first 26 entries ([a]=1 [b]=2 ... [z]=26)

  • =~ is used to test for a specific pattern; ^ designates the beginning of the pattern, $ designates the end of the string, [a-z] says to match any characters between a and z (inclusive), + says to match 1 or more

  • "${word}" =~ ^[a-z]+$ evaluates to true if ${word} is a) made up of only letters a-z and b) has at least one letter

  • ! negates the pattern test; in this case I'm looking for any words that have non-letter characters [NOTE: There are many ways to test for specific patterns; this just happens to be the method I chose to use for this script]

  • [[ ! "${word}" ... ]] && continue: if word contains a non-letter the test generates a true and (&&) then we continue (ie, we're not interested in this word so skip to the next word; in other words, skip to the next iteration of the loop)

  • while read -n 1 char : parse the input (in this case ${word} passed in as a 'here' string) 1 character at a time, putting the resulting string into a variable named 'char'

  • [[ "${char}" != [a-z] ]] && break : another/different pattern matching method; here we're testing the single character ${char} variable to see if it's not a letter and if so (ie, evals to true) then we break out of the current loop; if ${char} is a letter (a-z) then processing continues with the next command in the loop (sum=... in this case)

  • (( sum == 100 )) && \ echo "${word}" : yet another way to run a test; in this case we're testing to see if the sum of the letters is 100; if it evaluates to true then we also echo "${word}" [NOTE: the backslash (\) says to continue the command on the next line]

  • done <<< "${word}" : <<< is called a 'here' string; in this case it allows me to pass the current string (${word}) as an argument to the while read -n 1 char loop

markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • Works really well! Unfortunately when I try it with my original `Words.txt ` (which has 370101) words, it doesn't do anything. I got it from here https://github.com/dwyl/english-words/blob/master/words_alpha.txt it is the words_alpha.txt file. Also, do you know of any good place to learn more about bash syntax? I'm really new to bash so I really don't understand a lot of your code! Thanks! – Erik Jun 25 '17 at 23:43
  • @Erik: I downloaded that file and ran it through my script (the 'here' string version); it found the first 10 words in 3.6 secs [`abactinally`, `abatements`, `abbreviatable`, `abettors`, `abomasus`, `abreption`, `abrogative`, `absconders`, `absinthol`, `absorbancy`]; my copy of `agc`'s script found the same 10 words in ~65 secs; as for why your script "doesn't do anything" ... hard to say without being able to see your code; perhaps you could update your original post with a new section showing the latest version of your script – markp-fuso Jun 26 '17 at 00:07
  • @Erik: as for learning bash ... google is your friend; [NOTE: I normally code in ksh so I had to do a few google searches myself to figure out how to convert my ksh ideas to bash.]; and yes, I **do** understand what all those commands are doing in my script, but not sure which ones you're having problems with, so you'd have to ask – markp-fuso Jun 26 '17 at 00:09
  • added some comments about what the various commands are doing in the script – markp-fuso Jun 26 '17 at 00:40
  • Oh! I wasn't asking if you knew what was happening in your code, I was saying that I barely knew anything in it! Sorry if I offended you. Also, I meant I tried your code on my original dictionary and it just "sits" there not outputting anything. Perhaps there are compatibility issues? I'm using centos. Thanks you so much for your help. – Erik Jun 26 '17 at 03:46
  • I actually *don't* agree that Google is anyone's friend for learning bash -- it finds a lot of bad-practice, harmful examples (the "Advanced" Bash Scripting Guide is full of 'em). Much better for folks to work from known-good references -- the [BashGuide](http://mywiki.wooledge.org/BashGuide), the [Bash Hackers' Wiki](http://wiki.bash-hackers.org/), and so forth. – Charles Duffy Jun 26 '17 at 04:59
  • @Erik, ...if you want to see if it's *really* sitting around doing nothing, running `bash -x scriptname` -- perhaps `PS4=':$LINENO+' bash -x scriptname` to log each command with the line of source it came from -- is a good start. – Charles Duffy Jun 26 '17 at 05:01
  • @Erik: no offense taken; I made that comment because, while I had mentioned using google to do a bit of bash research, I made sure I understood what each command was doing (ie, I didn't just cut-n-paste and cross my fingers) ;-) – markp-fuso Jun 26 '17 at 11:51
  • This answer's pure `bash` method works here, compared to my revised answer it's *3x* as fast (i.e. *14s* for the whole of a `man` based *words.txt* versus *46s*) on my *i3*. – agc Jun 26 '17 at 13:25
  • Well, thanks to everyone for your help, I think I'm going to leave it as that for now. I may go back to these answers and try it out, but for now I think that's everything. – Erik Jun 26 '17 at 16:32
  • 1
    The reason this is *really slow* is the `while read -n 1 char` `done <<< "${word}"` loop. That creates a temp file for each and every word being fed to that loop. – dawg Jun 26 '17 at 20:35
  • @dawg: **palm-to-face** ... and since this particular VM is running on top of a TrueCrypt partition (long story), those pesky disk IOs are going to be slower than 'normal'; thanks for the reality check! – markp-fuso Jun 26 '17 at 20:39
  • You can try replacing the heredoc `done <<< "${word}"` with `done < <(printf '%s\n' "$word")` which creates an anonymous FIFO. Better still to just have a C style `for` loop to traverse the string and avoid the issue entirely. – dawg Jun 26 '17 at 20:52
  • @dawg: the process substitution (aka anon FIFO?) eliminates the temp file but actually runs much slower than the here string code; will look at the C style `for` loop, but not a big fan of repeated substrings (though more of a performance issue for longer strings) – markp-fuso Jun 26 '17 at 20:57
1

Let's try this with awk

NOTE: I'm not a heavy user of awk so there's likely some ways to tweak this for additional speed.

awk '
# initialize an array of character-to-number values
BEGIN {
    # split our alphabet into an array: c[1]=a c[2]=b ... c[26]=z;
    # NOTE: assumes input is all lower case, otherwise we could either
    # add array values for upper case letters or modify processing to
    # convert all characters to lower case ...
    split("abcdefghijklmnopqrstuvwxyz", c, "")

    # build associative array to match letters w/ numeric values:
    # ord[a]=1 ord[b]=2 ... ord[z]=26
    for (i=1; i <= 26; i++) {
        ord[c[i]]=i
    }
}
# now process our file of words
{
    # loop through words; just in case more than 1 word per line (ie, NF > 1)
    word=1
    while ( word <= NF ) {
        sum=0

        # split our word into an array of characters
        split($word, c, "")

        # loop through our array of characters
        for (i=1; i <= length($word); i++) {

            # if not a letter then break out of loop
            if ( c[i] !~ /[a-z]/ ) {
                sum=999
                break
            }

            # add letter to our running sum
            sum=sum + ord[c[i]]

            # if we go over 100 then break
            if ( sum >= 101 )
                break
        } # end of character loop

        if ( sum == 100 )
            print $word

        word++
    } # end of word loop
}' Words.txt

I ran some tests with the entire Words.txt file:

  • my previous bash solution: let's not talk about how really slow my machine is!

  • dawg's bash solution: 3 min 32 secs (about 2x slower than dawg's machine)

  • above awk solution: 3.5 secs (bound to be even faster on anything other than my PC)

markp-fuso
  • 28,790
  • 4
  • 16
  • 36
  • `awk`, huh? I've never used `awk` for anything, and yet this looks very simple and easy to understand. Thank you so much for you help! – Erik Jun 26 '17 at 16:51