Calculating the median value from a TSV file for each month

Question

I have a TSV file and I am trying to perform some analysis. I have a month column and the data is numbers 1-12 which corresponds to the month (Jan=1, Feb=2 etc). I am trying to setup a counter so that every time the code reads a 1 in the 6th column, it adds to Jan count, every 2 it reads, +1 to Feb count etc. Once all of the data has been iterated through. I need to find the median number of each month. I have some echo statements placed to troubleshoot. Here is my code and what it outputs:

breaches_per_month(){
    input_file=$1
    original_data=$(cat "input_file")

    #Setup of month array
    months=("Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec")
        # Read the input file line by line and increment the count for the respective month

    #Initialise count variables for each month
    declare -A month_counts
    for month in "${months[@]}"; do
        month_counts[$month]=0
    done
 
    # Read column 6 and increment count for corresponding month
    while IFS=$'\t' read -r _ _ _ _ _ month _; do
        case $month in
            1) month_counts["Jan"]=$((month_counts["Jan"] + 1));;
            2) month_counts["Deb"]=$((month_counts["Feb"] + 1));;
            3) month_counts["Mar"]=$((month_counts["Mar"] + 1));;
            4) month_counts["Apr"]=$((month_counts["Apr"] + 1));;
            5) month_counts["May"]=$((month_counts["May"] + 1));;
            6) month_counts["Jun"]=$((month_counts["Jun"] + 1));;
            7) month_counts["Jul"]=$((month_counts["Jul"] + 1));;
            8) month_counts["Aug"]=$((month_counts["Aug"] + 1));;
            9) month_counts["Sep"]=$((month_counts["Sep"] + 1));;
            10) month_counts["Oct"]=$((month_counts["Oct"] + 1));;
            11) month_counts["Nov"]=$((month_counts["Nov"] + 1));;
            12) month_counts["Dec"]=$((month_counts["Dec"] + 1));;
        esac
    done < "$input_file"

    # Calculate the median count
    counts=("${month_counts[@]}")
    median_count=$(printf '%s\n' "${counts[@]}" | sort -n | awk 'NR == int((length+1)/2) {print}')
    echo "Median count: $median_count"
 
    # Print the counts for each month
    for month in "${months[@]}"; do
        echo "$month: ${month_counts[$month]}"
    done
}
 
breaches_per_month "$1"

This was my output. I have since worked on this when it was uploaded and made improvements. The data has added correctly to the count but it is off by 1 in position. So Jun should have the value of 78, not 60. I suspect this is the case for all of the others so I have a positional error:

cat: input_file: No such file or directory
Median count: 60
Jan: 86
Feb: 88
Mar: 83
Apr: 62
May: 78
Jun: 60
Jul: 65
Aug: 68
Sep: 92
Oct: 95
Nov: 93
Dec: 77

`cat: input_file: No such file or directory` . That's not very encouraging. Learn to make the smallest sample of your problem. We don't really need to see 144 (?) iterations. Read about [mcve] and read the section 'How to turn bad script into a good question' at https://stackoverflow.com/tags/bash/info . Use the `{}` tool from the Edit menu on mouse selected text to achieve `code/data/output/errMsg` formatting. Good luck. — shellter, May 24 '23 at 03:50
Please [edit] to remove the line numbers. They are only impeding any useful copy/paste of the code. — tripleee, May 24 '23 at 04:02
https://stackoverflow.com/questions/6166375/median-of-column-with-awk shows how to calculate the median of a data set with Awk. With GNU Awk, you can do in-memory sort of an array; the code should then be reasonably easy to adapt to collect one array per month. — tripleee, May 24 '23 at 04:06
@tripleee. Just to clarify I want to calculate the median of each type of number in that column of the data set. So I want to calculate the median of the number of times 1 appears, the median of the number of times 2 appears etc. This is because each number corresponds to the month that incident occured in and I need the median incidents of each month — Spoku, May 24 '23 at 05:21
It is still unclear what you mean by "median number for each month". Every count for every month is 1, so the median is 1, unless you start filling up with zeros for those days (or hours? minutes? The counts are more than there are days so your granularity is some other unit) which did not have a match, in which case the median is zero if there were fewer matches than there are days (or etc) in the month (if I am guessing your input data structure correctly). — tripleee, May 24 '23 at 06:10
Please share a few lines of your TSV and the corresponding expected result. — Mark Setchell, May 24 '23 at 07:48

tripleee · Answer 1 · 2023-05-24T06:02:14.150

The central error here seems to be that length is not the number of input lines in total in Awk. You are extracting some bogus value which depends on the length of the current input line, probably several times for some inputs.

Here is a hopefully fixed median calculation:

    # Calculate the median count
    median_count=$(printf '%s\n' "${month_counts[@]}" | sort -n | awk -v len="${#month_counts[@]}" 'NR == int((len+1)/2) {print}')
    echo "Median count: $median_count"

This passes the actual number of elements in month_counts to Awk to be used as the index for the line to print, and coincidentally also avoids the silly, unnecessary copying of the array into counts for no perceptible value.

The big ugly repetitive case statement should be easy to avoid; you already have a months array, so just manipulate month_counts[months[month-1]]

Also, the typo with the missing dollar sign in the cat is obvious, but also, the value is never used for anything, so probably just delete that line altogether.

The whole script looks like it's pining to be reimplemented in Awk altogether, but that's a more challenging refactoring. GNU Awk provides built-in sorting functions (which are not part of standard Awk) which should help. (Conversely, if you don't have GNU Awk, maybe don't embark on this bigger project.)

the reason why I have that value with the cat typo is to refer back incase I need anything as was recommended by a friend. We do have access to GNU Awk but I am unfamiliar with it, "asort" was mentioned as being useful but I haven't gotten to understanding that yet. — Spoku, May 24 '23 at 07:11
If the input comes from a file, just having the file name is sufficient for obtaining that data at any time if you should need to. — tripleee, May 24 '23 at 07:16
If you need help with an Awk reimplementation, probably post a new question with your code so far, and ask for specific help where you are stuck. Perhaps link back to this question for context. — tripleee, May 24 '23 at 07:18

score 1 · Answer 2 · answered May 24 '23 at 10:49

since you didn't provide input data sample you're making hard for people to help you.

to simplify to the essence of the problem, assume a input file in that format (month#, data)

the script expects a sorted input in terms of the month column (here column one, yours will be different) and finds the median value of column two (agains yours will be different).

$ sort -n file | 
  awk 'NR>1 && p!=$1 {print p,a[int((k+1)/2)]; delete a; k=0}
                     {p=$1; a[++k]=$2} 
       END           {print p,a[int((k+1)/2)]}'

which will print

note that for even number of data points you may want to print the average of the two values instead of picking the low indexed value.

e.g. there are two points for month 3

3 60
3 79

so, most likely you'll want the median (60+79)/2, instead of 60. To achieve that, change printed value to (a[int((k+1)/2)]+a[int(k/2)+1])/2 which will only effect the median of even number of values.

score 0 · Answer 3 · answered May 24 '23 at 09:32

have a TSV file and I am trying to perform some analysis(...)need to find the median

If you are not strictly limited to GNU AWK you might find GNU datamash more handy, as it does provide median (and many other statistical functions). If you have file.tsv with TAB-separated values as follows

Jan 86
Feb 88
Mar 83
Apr 62
May 78
Jun 60
Jul 65
Aug 68
Sep 92
Oct 95
Nov 93
Dec 77

then you might compute median by doing

datamash median 2 < file.tsv

which gives output

80.5

Explanation: Instruct GNU datamash to give median of 2nd column.

(tested in GNU datamash 1.7)

That's apparently not what the OP's input data looks like, though. — tripleee, May 24 '23 at 10:02

score 0 · Answer 4 · answered May 24 '23 at 12:36

mawk '{ __[$1] = __[$1]" "$2 
} END { 
    for(_=_<_;_++ < 12;) { 
        if (_ in __) {
            ORS = ""
            print "    month : "(_)"   | unsorted data : ",
                 $0 = __[_], "\f\r   median : "

            ORS = "\n"
            if (NF < 3) { 

                print NF==1 ? $0 : ($1 + $2) / 2 

            } else { 
                split($0, ____)
                NF = 0

                for(_____ in ____) {
                     ___ = +____[_____]
                    $___ = ___ $___ 
                }
                $0 = $0
                NF = NF
                print NF % 2 ? $(++NF/2) \
                             : ($(___ = NF/2) + $++___)/ 2 } } } }'

    month : 1   | unsorted data :   39 51 10 
   median : 39
    month : 2   | unsorted data :   76 24 80 29 97 77 
   median : 76.5

    month : 3   | unsorted data :   79 60 
   median : 69.5
    month : 4   | unsorted data :   19 55 62 91 71 40 21 83 52 
   median : 55

Calculating the median value from a TSV file for each month

4 Answers4