How to get the unique elements of the first column and store it in an array?

Question

I want to extract the first column between this two lines (%BLOCK positions_frac & %ENDBLOCK positions_frac) in "file1".

%BLOCK positions_frac
Si        0.5303000000000000  0.0000000000000000  0.3333000000000000
Si        0.0000000000000000  0.5303000000000000  0.6666299999999999
Si        0.4697000000000000  0.4697000000000000  0.9999700000000000
O         0.1462000000000000  0.4142000000000000  0.8810000000000000
O         0.7320000000000000  0.5858000000000000  0.7856700000000000
O         0.5858000000000000  0.7320000000000000  0.2143300000000000
O         0.2680000000000000  0.8538000000000000  0.5476700000000000
O         0.4142000000000000  0.1462000000000000  0.1190000000000000
O         0.8538000000000000  0.2680000000000000  0.4523300000000000
%ENDBLOCK positions_frac

I can get that using:

awk '/%BLOCK\ positions_frac/{flag=1;next}/%ENDBLOCK\ positions_frac/{flag=0}flag' file1

Then I want to store the first column in an array but of the non-equivalent ones

expected output:

array= ["Si", "O"]

So 1. filter the first column. 2. Sort with unique `sort -u` and 3. store into an array. — KamilCuk, Jul 15 '19 at 13:08
See: [How do I assign the output of a command into an array?](https://stackoverflow.com/questions/9449417). Combine that with `sort -u` and your are off for a good start. — kvantour, Jul 15 '19 at 13:18
ok so I guess it is something like this: awk '/%BLOCK\ positions_frac/{flag=1;next}/%ENDBLOCK\ positions_frac/{flag=0}flag {print $1}' file1 | sort -u, but need some help sotring it in an array — Caterina, Jul 15 '19 at 13:19
it's not a duplicate, they're using grep. I'm still not sure how to store what I found with awk in an array — Caterina, Jul 15 '19 at 13:21
@Caterina it is a duplicate. Your problem is "How do I assign the output of a command into an array". The command is known: `awk '...' | sort -u`. The example of the duplicate is using `grep whatever` as command. — kvantour, Jul 15 '19 at 13:27
yes but I didn't know about the sort -u command, that problem does not include it. Without asking it here I wouldn't have been able to figure it out. — Caterina, Jul 15 '19 at 13:29
Having asked a question which is considered a duplicate is not something to be ashamed of. The question you asked is actually a double question: question 1: how do I sort an array. Question 2, how do I put the output of a command in an array. There are thousands of ways this can be answered. And there are a lot of similar questions around. We have answered your first question in a comment, and the second by pointing you to the source where you could find a possible solution. Your question is still good and should stay for other users of this forum to find help. — kvantour, Jul 15 '19 at 13:41

Ed Morton · Accepted Answer · 2019-07-15T13:57:16.840

3

This is how to write the awk part (squeeze it all back onto 1 line if you like):

$ awk '
    /%ENDBLOCK positions_frac/ { inBlock=0 }
    inBlock && !seen[$1]++     { print $1 }
    /%BLOCK positions_frac/    { inBlock=1 }
' file
Si
O

then it's just this to save the output in a shell array:

arr=( $(awk '...' ) )

edited Jul 15 '19 at 13:57

answered Jul 15 '19 at 13:33

Ed Morton

188,023
17
78
185

can you explain me this part: inBlock && !seen[$1]++, I am still new to bash, so I still don't get some things – Caterina Jul 15 '19 at 13:36
That has absolutely nothing to do with bash or any other shell, it's part of an awk script. Whenevr you see an array named `seen[]` it is (or should be!) being used idiomatically to identify unique values. Initially `seen[foo]` for any value of `foo` is zero-or-null so `seen[foo]++` is also zero-or-null but that post-increment means that next time you test `seen[foo]` it has the value 1. In that way you can tell if a value is being seen for the first time or not. So that line of my code just says "if you're in the target block and it's the first time this $1 has been seen then print it". – Ed Morton Jul 15 '19 at 13:52
In case it helps: `awk '!seen[$0]++'` is equivalent to `uniq` for sorted input but will also print unique values even if the input is unsorted. Run these commands to see the behavior of each and the similarities/differences between them: 1) `printf 'a\na\nb\n' | awk '!seen[$0]++'` 2) `printf 'a\na\nb\n' | uniq` 3) `printf 'a\nb\na\n' | awk '!seen[$0]++'` 4) `printf 'a\nb\na\n' | uniq`. Note the output of "4" vs the first 3. – Ed Morton Jul 15 '19 at 13:59

Caterina · Answer 2 · 2019-07-15T13:32:22.250

1

So this solved it:

arr=($( awk '/%BLOCK\ positions_frac/{flag=1;next}/%ENDBLOCK\ positions_frac/{flag=0}flag {print $1}' file1 |sort -u))

Thanks for the suggestions. I realized I just had to use pipelines.

edited Jul 15 '19 at 13:32

answered Jul 15 '19 at 13:26

Caterina

775
9
26

Why are you doing `sed 's/:.*//'` when the output doesn't contain `:`s? You don't need `sort` btw, awk can print unique values just fine, There's also no reason to escape a blank char, it's not special in any way, – Ed Morton Jul 15 '19 at 13:29

How to get the unique elements of the first column and store it in an array?

2 Answers2

Linked