Bash: Reading CSV text file and finding average of rows

Question

This is the sample input (the data has user-IDs and the number of hours spent by the user):

Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0

I need to read the data, find all User-IDs ending in even numbers (2,4,6,8..) and find average number of hours spent (over five days).

I wrote the following script:

hoursarray=(0,0,0,0,0)
while IFS=, read -r col1 col2 col3 col4 col5 col6 col7 || [[ -n $col1 ]]
do
    if [[ $col2 == *"2" ]]; then
        #echo "$col2"
        ((hoursarray[0] = col3 + col4 + col5 + col6 + col7))
    elif  [[ $col2 == *"4" ]]; then 
        #echo "$col2"
        ((hoursarray[1] = hoursarray[1] + col3 + col4 + col5 + col6 + col7))
    elif [[ $col2 == *"6" ]]; then
        #echo "$col2"
        ((hoursarray[2] = hoursarray[2] + col3 + col4 + col5 + col6 + col7))
    elif [[ $col2 == *"8" ]]; then
        #echo "$col2"
        ((hoursarray[3] = hoursarray[3] + col3 + col4 + col5 + col6 + col7))
    elif [[ $col2 == *"10" ]]; then
        #echo "$col2"
        ((hoursarray[4] = hoursarray[4] + col3 + col4 + col5 + col6 + col7))
    fi
done < <(tail -n+2 user-list.txt)
echo ${hoursarray[0]}
echo "$((hoursarray[0]/5))"

This is not a very good way of doing this. Also, the numbers arent adding up correctly.

I am getting the following output (for the first one - user2):

27
5

I am expecting the following output:

27
5.4

What would be a better way to do it? Any help would be appreciated.

TIA

given the sample input, please update the question with a) the (wrong) output being generated by your script and b) the (correct) desired output — markp-fuso, Nov 20 '20 at 16:22
`hoursarray=(0,0,0,0,0)` generates an array with a single item containing the value `0,0,0,0,0` ... `typeset -p hoursarray` => `declare -a hoursarray=([0]="0,0,0,0,0")` ; replace the commas with spaces in order to generate a 5-element array ... `hoursarray=(0 0 0 0 0) ; typeset -p hoursarray` => `declare -a hoursarray=([0]="0" [1]="0" [2]="0" [3]="0" [4]="0")` ... then see if this helps any with the follow-on code — markp-fuso, Nov 20 '20 at 16:25
not saying I'd do it this way but `[[ $col2 == *"10" ]]` should probably be `[[ $col2 == *"0" ]]` so as to match on `*20`, `*30`, `*200`, etc — markp-fuso, Nov 20 '20 at 16:28
Bash does not have floating point arithmetic BTW so you will never get a floating point result... — dawg, Nov 20 '20 at 16:40
are you looking for the average for each day (ie, 5x averages) or the average across all days (ie, 1x average)?; please update the question to show the expected results for the sample output, ie, you've shown `user2` but have left out `user4` ... so what would the final results look like where you have multiple users? — markp-fuso, Nov 20 '20 at 16:42
Does this answer your question? [Floating point results in Bash integer division](https://stackoverflow.com/questions/15015809/floating-point-results-in-bash-integer-division) — dawg, Nov 20 '20 at 17:06

dawg · Answer 1 · 2020-11-20T17:09:11.127

You issue is echo "$((hoursarray[0]/5))" Bash does not have floating point, so it returns the integer portion only.

Easy to demonstrate:

$ hours=27
$ echo "$((hours/5))"
5

If you want to stick to Bash, you could use bc for the floating point result:

$ echo "$hours / 5.0" | bc -l
5.40000000000000000000

Or use awk, perl, python, ruby etc.

Here is an awk you can parse out. Easily modified to you use (which is a little unclear to me)

awk -F, 'FNR==1{print $2; next} 
     {arr[$2]+=($3+$4+$5+$6+$7) }   
     END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file

Prints:

User ID
User1       27  5.4
User2       27  5.4
User3       22  4.4
User4       20  4
User5       40  8

If you only want even users, filter for User that end in any of 0,2,4,6,8:

awk -F, 'FNR==1{print $2; next} 
         $2~/[24680]$/ {arr[$2]+=($3+$4+$5+$6+$7) } 
         END{ for (e in arr) print e "\t\t" arr[e] "\t" arr[e]/5 }' file

Prints:

User ID
User2       27  5.4
User4       20  4

Thanks! But the script needs to print averages for only IDs ending in even numbers (User2, User4, ....). — saadurr, Nov 20 '20 at 16:54

tink · Accepted Answer · 2020-11-20T16:59:57.017

0

Your description is fairly imprecise, but here's an attempt primarily based on the sample output:

awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};print a;printf "%.2g\n",a/5; a=0}' file 
20
4
27
5.4

$2~/[24680]$/ makes sure we only look at "even" user-IDs.

for(i=3;i<=7;i++){} iterates over the day columns and adds them.

Edit 1: Accommodating new requirement:

awk -F, '$2~/[24680]$/{for(i=3;i<=7;i++){a+=$i};printf "%s\t%.2g\n",$2,a/5;a=0}' saad 
User4   4
User2   5.4

edited Nov 20 '20 at 16:59

answered Nov 20 '20 at 16:48

tink

14,342
4
46
50

Is there any way that only user-ids ending in an even number can be printed only? – saadurr Nov 20 '20 at 16:52
I did that ... why do you think it doesn't do that? the output you're seeing is only for user 2 and 4 – tink Nov 20 '20 at 16:53
I tried to run the script. Its printing out averages for all user-ids – saadurr Nov 20 '20 at 16:56
How can I modify it to print in `userID averageOfUserID` format? – saadurr Nov 20 '20 at 16:56
No, it doesn't. Not with your sample data; if reality differs from the sample fix the sample. And yes, that's easy; but again, it doesn't match the "expected output" in your question. – tink Nov 20 '20 at 16:57
One thing, can you explain this part a bit more? `$2~/[24680]$/` – saadurr Nov 20 '20 at 17:04
Sure ... `$2` is the column that holds the user ID. `~` is a regular expression match. `[24680]$` is the regular expression. What all together means is: only process lines where the text in the 2nd column ends in an even number `$` means "end of field". – tink Nov 20 '20 at 17:07

markp-fuso · Answer 3 · 2020-11-20T17:07:00.943

Sample data showing userIDs with even and odd endings, userID showing up more than once (eg, User2), and some non-integer values:

$ cat user-list.txt
Computer ID,User ID,M,T,W,T,F
Computer1,User3,5,7,3,5,2
Computer2,User5,8,8,8,8,8
Computer3,User4,0,8,0,8,4
Computer4,User1,5,4,5,5,8
Computer5,User2,9,8,10,0,0
Computer5,User120,9,8,10,0,0
Computer5,User2,4,7,12,3.5,1.5

One awk solution to find total hours plus averages, across 5x days, with duplicate userIDs rolled into a single set of numbers, but limited to userIDs that end in an even number:

$ awk -F',' 'FNR==1 { next } $2 ~ /[02468]$/ { tot[$2]+=($3+$4+$5+$6+$7) } END { for ( i in tot ) { print i, tot[i], tot[i]/5 } }' user-list.txt

Where:

-F ',' - use comma as input field delimiter
FNR==1 { next } - skip first line
$2 ~ /[02468]$/ - if field 2 ends in an even number
tot[$2]+=($3+$4+$5+$6+$7) - add current line's hours to array where userID is the array index; this will add up hours from multiple input lines (for same userID) into a single array cell
for (...) { print ...} - loop through array indices printing the index, total hours and average hours (total divided by 5)

The above generates:

User120 27 5.4
User2 55 11
User4 20 4

Depending on OPs desired output the print can be replaced with printf and the desired format string ...

score 0 · Answer 4 · answered Nov 20 '20 at 18:04

0

 Here is your script modified a little bit:
  
 while IFS=, read -r col1 col2 col3 || [[ -n $col1 ]]
 do
       (( $(sed 's/[^[:digit:]]*//' <<<$col2) % 2 )) || ( echo -n "For $col1 $col2 average is: " && echo "($(tr , + <<<$col3))/5" | bc -l )
 done < <(tail -n+2 list.txt)

prints:

 For Computer3 User4 average is: 4.00000000000000000000
 For Computer5 User2 average is: 5.40000000000000000000

answered Nov 20 '20 at 18:04

Archil Elizbarashvili

29
2

This is Bash way solution. If you want to gain the execution time AWK script would be much more performant. – Archil Elizbarashvili Nov 20 '20 at 18:17

Bash: Reading CSV text file and finding average of rows

4 Answers4