Speed up shell script/Performance enhancement of shell script

Question

Is there a way to speed up the below shell script? It's taking me a good 40 mins to update about 150000 files everyday. Sure, given the volume of files to create & update, this may be acceptable. I don't deny that. However, if there is a much more efficient way to write this or re-write the logic entirely, I'm open to it. Please I'm looking for some help

    #!/bin/bash
    
    DATA_FILE_SOURCE="<path_to_source_data/${1}"
    DATA_FILE_DEST="<path_to_dest>"
    
    for fname in $(ls -1 "${DATA_FILE_SOURCE}")
    do
        for line in $(cat "${DATA_FILE_SOURCE}"/"${fname}")
        do
            FILE_TO_WRITE_TO=$(echo "${line}" | awk -F',' '{print $1"."$2".daily.csv"}')
            CONTENT_TO_WRITE=$(echo "${line}" | cut -d, -f3-)
            if [[ ! -f "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}" ]]
            then
                echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
            else
                if ! grep -Fxq "${CONTENT_TO_WRITE}" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
                then
                  sed -i "/${1}/d" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
"${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
                    echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
                fi
            fi
        done
    done

Is `$(ls -1 "${DATA_FILE_SOURCE}")` necessary? Use `${DATA_FILE_SOURCE}` instead — Arnaud Valmary, Sep 16 '21 at 12:49
`awk` and `cut` may be heavy. You could use bash substitutions : `T1="${line%%,*}"; line="${line#*,}"; T2="${line%%,*}"; line="${line#*,}"; FILE_TO_WRITE="$T1.$T2.daily.csv"; CONTENT_TO_WRITE="$line"` — Arnaud Valmary, Sep 16 '21 at 12:53
I don't understand the `sed` line. If the dest file exists and it does not already contain the data, delete lines containing the name of the source file? Is that right? — glenn jackman, Sep 16 '21 at 13:29
this would be a lot easier if you were to provide a couple (small) sample input files, describe the operation, and show the expected results — markp-fuso, Sep 16 '21 at 16:09

score 3 · Accepted Answer · answered Sep 16 '21 at 14:03

There are still parts of your published script that are unclear like the sed command. Although I rewrote it with saner practices and much less external calls witch should really speed it up.

#!/usr/bin/env sh

DATA_FILE_SOURCE="<path_to_source_data/$1"
DATA_FILE_DEST="<path_to_dest>"

for fname in "$DATA_FILE_SOURCE/"*; do
  while IFS=, read -r a b content || [ "$a" ]; do
    destfile="$DATA_FILE_DEST/$a.$b.daily.csv"
    if grep -Fxq "$content" "$destfile"; then
        sed -i "/$1/d" "$destfile"
    fi
    printf '%s\n' "$content" >>"$destfile"
  done < "$fname"
done

These are great suggestions. @usert4jju7, it's having to run `grep` AND `sed` for every line of every file is what is really killing the performance. You have 150,000 files. If each file has 1,000 lines, then you're invoking grep 150,000,000 times. — glenn jackman, Sep 16 '21 at 14:06

score 0 · Answer 2 · answered Sep 16 '21 at 13:59

Make it parallel (as much as you can).

#!/bin/bash
set -e -o pipefail

declare -ir MAX_PARALLELISM=20  # pick a limit
declare -i pid
declare -a pids

# ...

for fname in "${DATA_FILE_SOURCE}/"*; do
  if ((${#pids[@]} >= MAX_PARALLELISM)); then
    wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2
    unset 'pids[pid]'
  fi

  while IFS= read -r line; do
    FILE_TO_WRITE_TO="..."
    # ...
  done < "${fname}" &  # forking here
  pids[$!]="${fname}"
done

for pid in "${!pids[@]}"; do
  wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2
done

Here’s a directly runnable skeleton showing how the harness above works (with 36 items to process and 20 parallel processes at most):

#!/bin/bash
set -e -o pipefail

declare -ir MAX_PARALLELISM=20  # pick a limit
declare -i pid
declare -a pids

do_something_and_maybe_fail() {
  sleep $((RANDOM % 10))
  return $((RANDOM % 2 * 5))
}

for fname in some_name_{a..f}{0..5}.txt; do  # 36 items
  if ((${#pids[@]} >= MAX_PARALLELISM)); then
    wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2
    unset 'pids[pid]'
  fi

  do_something_and_maybe_fail &  # forking here
  pids[$!]="${fname}"
  echo "${#pids[@]} running" 1>&2
done

for pid in "${!pids[@]}"; do
  wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2
done

Strictly avoid external processes (such as awk, grep and cut) when processing one-liners for each line. fork()ing is extremely inefficient in comparison to:
- Running one single awk / grep / cut process on an entire input file (to preprocess all lines at once for easier processing in bash) and feeding the whole output into (e.g.) a bash loop.
- Using Bash expansions instead, where feasible, e.g. "${line/,/.}" and other tricks from the EXPANSION section of the man bash page, without fork()ing any further processes.
Off-topic side notes:
- ls -1 is unnecessary. First, ls won’t write multiple columns unless the output is a terminal, so a plain ls would do. Second, bash expansions are usually a cleaner and more efficient choice. (You can use nullglob to correctly handle empty directories / “no match” cases.)
- Looping over the output from cat is a (less common) useless use of cat case. Feed the file into a loop in bash instead and read it line by line. (This also gives you more line format flexibility.)

Hey Thankyou Andrej. Could you please use this on my specific case instead of a generic example — usert4jju7, Sep 16 '21 at 17:00
@usert4jju7 No, obviously, because I don’t have your input data. However, it should be straightforward to modify your script based on the same principles: Run the inner loop in the background and keep track of the number (and return statuses) of background processes. (Presumably, in case if `FILE_TO_WRITE_TO` is not unique, a few extra precautions would be needed.) — Andrej Podzimek, Sep 17 '21 at 04:13

Speed up shell script/Performance enhancement of shell script

2 Answers2

Linked