0

I am trying to store the result of a pattern matched by awk to a shell array variable. Here's a simplified example of the same:

#!/bin/bash
declare -a array1=()
declare -a array2=()
READ_FILE="directory1/read_file.csv"
WRITE_FILE="directory2/results.csv"

#variable for counting array index
count1=0
count2=0
#
#
# need help with line below
# $2 below is the second set of characters which is a floating point number
awk -F 'string1_to_search' '{$array1[count1++] = $2}' $READ_FILE 
awk -F 'string2_to_search' '{$array2[count2++] = $2}' $READ_FILE 
#count++ indicates post increment of count variable

#do something with the array
.
.
#end

any suggestions would be helpful.

ggulgulia
  • 2,720
  • 21
  • 31
  • Awk doesn't really have access to the shell's variables, or vice versa. Could you perhaps refactor your problem to do all your processing in an Awk script? Or conversely, have Awk process the file once and print the results in a form which the shell can parse directly. But I'm thinking maybe the proper solution is to switch to a modern scripting language like Python if your requirements are nontrivial. – tripleee Jan 07 '18 at 16:41
  • No i cannot do it in python. I can refactor my problem but i need to do it in bash. Problem is I am no bash expert – ggulgulia Jan 07 '18 at 16:50
  • Can you outline the broader purpose of this script then? Does it *require* these arrays to be Bash arrays? An Awk script is probably the easiest way to refactor this, but if you need features which are not available in Awk, that complicates things (though you *can* call external commands from Awk, too). – tripleee Jan 07 '18 at 16:58
  • i need to run this in a supercomputer which will parse several output files and calculate mean and variance of the run time. – ggulgulia Jan 07 '18 at 17:00
  • 1
    You are using `bash` to do data analysis on a supercomputer? – chepner Jan 07 '18 at 17:03
  • yes. It is trivial and I need to do it in bash so I am doing it in bash. – ggulgulia Jan 07 '18 at 17:04
  • Mean and variance means you need floating-point arithmetic, which `bash` doesn't do. Shift the parsing to whatever language is doing the actual calculations. – chepner Jan 07 '18 at 17:22
  • i calculated mean and square roots using bash already and the results are absolutely correct. And a colleague already did the mean and variance using bash. My last resort would be asking the colleague! – ggulgulia Jan 07 '18 at 17:25
  • Well, it's a good thing you have supercomputer handy then :-). If you'd like to be able to do the analysis on a laptop instead then see the suggestions above and https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice. Oh, and as already stated bash does not do floating-point arithmetic so your colleagues code must already be calling an external tool if it is doing so. – Ed Morton Jan 07 '18 at 18:12
  • Thanks I am a student at Technical University of Munich and they gave us access to SuperMUC :D . I have done mean calculation of run times of 20 samples and did a cross check using another tool and it was correct till 4 decimal places. But now I must check if it is able to do other calculations – ggulgulia Jan 07 '18 at 18:26
  • so upon a further research, I found out that bash cannot handle floating point numbers but there are other tools like gwak and bc that can and apparently my so called colleague used bc tool to calculate the statistical quantities. I myself used gwak to calculate the mean. But the crux is : it can be done in bash – ggulgulia Jan 08 '18 at 08:33
  • 1
    You just said `bash cannot handle floating point numbers` and then `it can be done in bash`. Neither bc nor gawk are bash. Everyone is telling you you need to use a tool other than bash and you're arguing that you can/must use bash while telling us you are already using a tool other than bash. No-one is suggesting you can't call the external tools from bash but you're confusing us by insisting that `i need to do it in bash`. Just do it all in awk. – Ed Morton Jan 08 '18 at 12:46
  • I admitted I am no expert at bash or these tools. I just learned all of it yesterday and the very reason I am seeking out to this community is because I don't know it. In my post I demonstrated the code using awk which implied i could have called external tools using bash. I think I missed out saying in my comment that my colleague used the bc tool from within the bash and this is what I am suppose to do too. I admit I was not clear but this lack of clarity is due to my lack of knowledge and not lack in the clarity of thought. – ggulgulia Jan 08 '18 at 20:04

2 Answers2

1

Something roughly like this, then?

awk '/string1_to_search/ {
        count["id1"]++; sum["id1"] += $2 }
    /string2_too/ {
        count["id2"]++; sum["id2"] += $2 }
    # ...
    END { for (k in count) printf("%s: sum %f/count %i = avg %f\n", k, sum[k], count[k], sum[k]/count[k]) }' inputfile

I seem to recall there was a clever way to calculate a rolling variance without keeping the entire input set in memory; or else just collect the values space-separated value["id"] = value["id"] " " $2 and split into a list and loop over it near the end. Alternatively, simplify this to only examine one search string at a time and run it multiple times (let's hope then the input isn't very big). Or switch to Perl, which will easily let you collect lists of lists and other nested structures.

Obviously break out common functionality into separate functions so you don't have repeated code ... I suppose it's actually clearer like this, but if you find bugs, or need other changes, you only want to have to change one place in the code.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • thanks.. seems very close to what I want to get. I will let you know if I was able to get this working :) – ggulgulia Jan 07 '18 at 18:29
  • Obviously break out common functionality into separate functions so you don't have repeated code ... I meant to mention that in the answer but I suppose it's actually clearer like this. – tripleee Jan 07 '18 at 19:38
  • i did it using some other method but this works too. – ggulgulia Jan 08 '18 at 20:05
  • I never tire of referring people to [this](https://stackoverflow.com/a/9790156/1072112), too. :) – ghoti Jan 08 '18 at 20:23
0

another method to do it is making awk print the number which can be passed to an array variable in bash like this :

mapfile -t array1 < <( awk -F 'string1_to_search' '{print $2}' "$READ_FILE" )

Later for taking out mean, variance and SD we can use bc tool from within the bash

ggulgulia
  • 2,720
  • 21
  • 31
  • There could be external factors which are not obvious here which make `bc` a good choice, but with what you've told us here, I think there is a consensus that using an Awk script to collect the values and perform these calculations would seem like a better approach. – tripleee Jan 08 '18 at 20:23