1

I've data which looks like this:

 abc.com  Hello World Ann
 abc.com  Hi there friend
 def.com  Hello Sam
 def.com  Hello Dan
 abc.com  Hello World Mary

The string B can contain varying text but I've extracted keywords from that string to map with the below array, (this is not exact match of the String B)

keywords=( ["Hello World"]="h1" ["Hello"]="h2" ["Hi there"]="h3" )

I want to generate output like this:

A         Key    Count
abc.com   h1    2
abc.com   h3    1
def.com   h2    1

which contains the count of occurrences of the combinations and A and keywords array. I'm new to using shell scripts and unable to start with any logic. All ideas highly appreciated! Thanks

user2340345
  • 793
  • 4
  • 16
  • 38

3 Answers3

2

If awk can be consider for this, you could try this:

awk -F' *[AB]: *' '{a[$2","$3]++;next}END{print "A","B","Count";for(i in a){print i,a[i]}}' OFS=',' file | column -t -s','

-F option set the delimiter to either A: or B:.

The array a filled with the number of B string occurence.

The END statement prints the header and loop through the array to print the string and count.

At last the command column is displayed the result in a table format.


In response of OP's last change, a possible way forward is to define string using -v option and look up these string with ~ regex command.

awk -F' *[AB]: *' -v h1="Hello World" -v h2="Hello" -v h3="Hi there" '$3~h1{a[$2","h1]++;next}$3~h2{a[$2","h2]++;next}$3~h3{a[$2","h3]++;next}END{print "A","Key","Count";for(i in a){print i,a[i]}}' OFS=',' file | column -t -s','
oliv
  • 12,690
  • 25
  • 45
  • if you do allow, with your kind permission, I have added a solution from yours by adding a small logic to it where it will do print the lines in same order in which they are in Input_file. – RavinderSingh13 May 31 '18 at 07:08
  • @oliv, This is not giving the expected output, I want the count to be a combination of A and B, which is mapped with keywords array and print A, keyword, count. – user2340345 May 31 '18 at 08:42
  • @user2340345 The change you made in your question has a totally different scope... Please find the updated answer. – oliv May 31 '18 at 09:02
  • @oliv Thanks for the updated answer. What if I've 25 keywords? Shall I keep appending the values? – user2340345 May 31 '18 at 09:07
  • @user2340345 No! You' d better have all these strings in a separate file and lookup with awk both files. – oliv May 31 '18 at 09:12
  • @oliv the updated awk command works well, except that it prints the B strings instead of the Key of array. Im not much familiar with awk for lookups, can you please share that command ? – user2340345 May 31 '18 at 09:19
  • @user2340345 That's a different question and solution that deserve its own post. If you want to help other people, you'd better keep the scope of the question to its original goal such that all answers match the question. So first look at existing `awk` post, and if you don't find what you want, please create a new post with detailed information of what you exactly want. – oliv May 31 '18 at 09:44
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/172159/discussion-between-user2340345-and-oliv). – user2340345 May 31 '18 at 11:20
1

Taking reference from Oliv's nice answer here and adding a small logic where output should come in same sequence as per Input_file's sequence.

awk -F' *[AB]: *' '
!b[$2","$3]++{
  c[++count]=$2","$3}
{
  a[$2","$3]++;
  next
}
END{
  print "A","B","Count";
  for(i=1;i<=count;i++){
    print c[i],a[c[i]]}
}' OFS=, Input_file | column -t -s','
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
0

bash

Since associative arrays are inherently unordered, if you need to do the comparisons in a particular order (e.g. "Hello World" should match before "B:Hello") then you need another array to hold the ordering of the keys.

#!/bin/bash
declare -A keywords=( ["Hello World"]="h1" ["B:Hello"]="h2" ["Hi there"]="h3" )
ordered_keys=( "Hello World" "B:Hello" "Hi there" )
declare -A count

# assume a space between "A:" and "abc.com"
while read -r labelA a b; do
    for key in "${ordered_keys[@]}"; do
        if [[ $b == *"$key"* ]]; then
            let count["$a ${keywords[$key]}"]++
            break
        fi
    done
done <<DATA
A: abc.com B:Hello World Ann
A: abc.com B:Hi there friend
A: def.com B:Hello Sam
A: def.com B:Hello Dan
A: abc.com B:Hello World Mary
DATA

{
    echo "A Key Count"
    for key in "${!count[@]}"; do
        echo $key ${count[$key]}
    done
} | column -t

outputs

A        Key  Count
abc.com  h3   1
abc.com  h1   2
def.com  h2   2

Take care not to do this:

produce_the_data | while read ...; do count[x]=y; ...; done

Because that will run the while loop in a subshell, and the count array will not exist when the loop finishes.

There are ways to do this:

  1. use temp files (or a FIFO)

    tmpfile=$(mktemp)
    >"$tmpfile"  produce_the_data
    <"$tmpfile"  while read ...; do count[x]=y; ...; done
    
  2. set the lastpipe shell option

    set +m
    shopt -s lastpipe
    produce_the_data | while read ...; do count[x]=y; ...; done
    
  3. use a process substitution:

    while read ...; do count[x]=y; ...; done < <(produce_the_data)
    # .......................................^.^^................^
    #                                        | |
    # typical redirection -------------------+ |            
    # process substitution acts like a file ---+
    
glenn jackman
  • 238,783
  • 38
  • 220
  • 352
  • Im getting this error `line 2: Hello World: syntax error in expression (error token is "World") `, which ever string has a space it is giving this error. I do not need order to be maintained, so how can i just do it with `${keywords[@]}`.And what does labelA in while loop mean? you can consider a space between B:hello also like `B: hello`. – user2340345 Jun 01 '18 at 07:11
  • `read -r labelA a b` will put the first word in the variable "labelA", the second word in "a" and all the rest into "b". So it doesn't matter if there's a space after "B:". Except that the pattern "B:Hello" may no longer match. – glenn jackman Jun 01 '18 at 14:34
  • Regarding line 2, my code is correct. Did you type it exactly? – glenn jackman Jun 01 '18 at 14:47
  • yes exactly! Can you please have a look at the input? just updated it – user2340345 Jun 01 '18 at 18:07