0

Is there a way to speed up the below shell script? It's taking me a good 40 mins to update about 150000 files everyday. Sure, given the volume of files to create & update, this may be acceptable. I don't deny that. However, if there is a much more efficient way to write this or re-write the logic entirely, I'm open to it. Please I'm looking for some help

    #!/bin/bash
    
    DATA_FILE_SOURCE="<path_to_source_data/${1}"
    DATA_FILE_DEST="<path_to_dest>"
    
    for fname in $(ls -1 "${DATA_FILE_SOURCE}")
    do
        for line in $(cat "${DATA_FILE_SOURCE}"/"${fname}")
        do
            FILE_TO_WRITE_TO=$(echo "${line}" | awk -F',' '{print $1"."$2".daily.csv"}')
            CONTENT_TO_WRITE=$(echo "${line}" | cut -d, -f3-)
            if [[ ! -f "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}" ]]
            then
                echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
            else
                if ! grep -Fxq "${CONTENT_TO_WRITE}" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
                then
                  sed -i "/${1}/d" "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
"${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
                    echo "${CONTENT_TO_WRITE}" >> "${DATA_FILE_DEST}"/"${FILE_TO_WRITE_TO}"
                fi
            fi
        done
    done
usert4jju7
  • 1,653
  • 3
  • 27
  • 59
  • 2
    Is `$(ls -1 "${DATA_FILE_SOURCE}")` necessary? Use `${DATA_FILE_SOURCE}` instead – Arnaud Valmary Sep 16 '21 at 12:49
  • 1
    `awk` and `cut` may be heavy. You could use bash substitutions : `T1="${line%%,*}"; line="${line#*,}"; T2="${line%%,*}"; line="${line#*,}"; FILE_TO_WRITE="$T1.$T2.daily.csv"; CONTENT_TO_WRITE="$line"` – Arnaud Valmary Sep 16 '21 at 12:53
  • 1
    I don't understand the `sed` line. If the dest file exists and it does not already contain the data, delete lines containing the name of the source file? Is that right? – glenn jackman Sep 16 '21 at 13:29
  • this would be a lot easier if you were to provide a couple (small) sample input files, describe the operation, and show the expected results – markp-fuso Sep 16 '21 at 16:09

2 Answers2

3

There are still parts of your published script that are unclear like the sed command. Although I rewrote it with saner practices and much less external calls witch should really speed it up.

#!/usr/bin/env sh

DATA_FILE_SOURCE="<path_to_source_data/$1"
DATA_FILE_DEST="<path_to_dest>"

for fname in "$DATA_FILE_SOURCE/"*; do
  while IFS=, read -r a b content || [ "$a" ]; do
    destfile="$DATA_FILE_DEST/$a.$b.daily.csv"
    if grep -Fxq "$content" "$destfile"; then
        sed -i "/$1/d" "$destfile"
    fi
    printf '%s\n' "$content" >>"$destfile"
  done < "$fname"
done
Léa Gris
  • 17,497
  • 4
  • 32
  • 41
  • 1
    These are great suggestions. @usert4jju7, it's having to run `grep` AND `sed` for every line of every file is what is really killing the performance. You have 150,000 files. If each file has 1,000 lines, then you're invoking grep 150,000,000 times. – glenn jackman Sep 16 '21 at 14:06
0
  1. Make it parallel (as much as you can).

    #!/bin/bash
    set -e -o pipefail
    
    declare -ir MAX_PARALLELISM=20  # pick a limit
    declare -i pid
    declare -a pids
    
    # ...
    
    for fname in "${DATA_FILE_SOURCE}/"*; do
      if ((${#pids[@]} >= MAX_PARALLELISM)); then
        wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2
        unset 'pids[pid]'
      fi
    
      while IFS= read -r line; do
        FILE_TO_WRITE_TO="..."
        # ...
      done < "${fname}" &  # forking here
      pids[$!]="${fname}"
    done
    
    for pid in "${!pids[@]}"; do
      wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2
    done
    

    Here’s a directly runnable skeleton showing how the harness above works (with 36 items to process and 20 parallel processes at most):

    #!/bin/bash
    set -e -o pipefail
    
    declare -ir MAX_PARALLELISM=20  # pick a limit
    declare -i pid
    declare -a pids
    
    do_something_and_maybe_fail() {
      sleep $((RANDOM % 10))
      return $((RANDOM % 2 * 5))
    }
    
    for fname in some_name_{a..f}{0..5}.txt; do  # 36 items
      if ((${#pids[@]} >= MAX_PARALLELISM)); then
        wait -p pid -n || echo "${pids[pid]} failed with ${?}" 1>&2
        unset 'pids[pid]'
      fi
    
      do_something_and_maybe_fail &  # forking here
      pids[$!]="${fname}"
      echo "${#pids[@]} running" 1>&2
    done
    
    for pid in "${!pids[@]}"; do
      wait -n "$((pid))" || echo "${pids[pid]} failed with ${?}" 1>&2
    done
    
  2. Strictly avoid external processes (such as awk, grep and cut) when processing one-liners for each line. fork()ing is extremely inefficient in comparison to:

    • Running one single awk / grep / cut process on an entire input file (to preprocess all lines at once for easier processing in bash) and feeding the whole output into (e.g.) a bash loop.
    • Using Bash expansions instead, where feasible, e.g. "${line/,/.}" and other tricks from the EXPANSION section of the man bash page, without fork()ing any further processes.
  3. Off-topic side notes:

    • ls -1 is unnecessary. First, ls won’t write multiple columns unless the output is a terminal, so a plain ls would do. Second, bash expansions are usually a cleaner and more efficient choice. (You can use nullglob to correctly handle empty directories / “no match” cases.)

    • Looping over the output from cat is a (less common) useless use of cat case. Feed the file into a loop in bash instead and read it line by line. (This also gives you more line format flexibility.)

Andrej Podzimek
  • 2,409
  • 9
  • 12
  • Hey Thankyou Andrej. Could you please use this on my specific case instead of a generic example – usert4jju7 Sep 16 '21 at 17:00
  • @usert4jju7 No, obviously, because I don’t have your input data. However, it should be straightforward to modify your script based on the same principles: Run the inner loop in the background and keep track of the number (and return statuses) of background processes. (Presumably, in case if `FILE_TO_WRITE_TO` is not unique, a few extra precautions would be needed.) – Andrej Podzimek Sep 17 '21 at 04:13