Get comma separated distinct values in unix

Question

I have a unix file Err_Call_sipregtracking.csv as follows

colnum~filename~date~fieldnum~name~value
15~YYYYMMDD_BDACA_SELFRELATIVE_ARN~30MAR2016:00:00:00~1~BDA_CA_Code~1
15~YYYYMMDD_BDACA_SELFRELATIVE_ARN~30MAR2016:00:00:00~2~ARN_Code~2
15~YYYYMMDD_BDACA_SELFRELATIVE_ARN~30MAR2016:00:00:00~544~ALL~0
15~YYYYMMDD_BDACA_SELFRELATIVE_ARN~30MAR2016:00:00:00~544~ALL~0

Here delimeter is ~. I want distinct values of name column into a variable

My required output is:

'BDA_CA_Code','ARN_Code','ALL'

Please help me to achieve this.I tried with this

cat Err_Call_sipregtracking.csv | awk -F'~' '{print $5}' | uniq

Output is:

name
BDA_CA_Code
ARN_Code
ALL

But i don't want header in the result and I also want them in quotes and comma separated.

note `cat file | awk 'things'` is not necessary. `awk 'things' file` suffices. — fedorqui, Jul 29 '16 at 09:04

score 5 · Answer 1 · edited Sep 09 '17 at 20:05

5

The key here is to store the values in an array, so you then print all the elements:

$ awk -F'~' 'NR>1{item[$5]} END {for (i in item) print i}' file
ARN_Code
BDA_CA_Code
ALL

Note the usage of NR>1 to skip the header.

Then, you can print the elements wrapped with single quotes with printf "\047%s\047\n", i, since print "\047hello\047" prints 'hello':

$ awk -F'~' 'NR>1{item[$5]} END {for (i in item) printf "\047%s\047\n", i}' file
'ARN_Code'
'BDA_CA_Code'
'ALL'

To join these into a comma-separated list of items, just print a comma before every item starting from the second one (credits to Ed Morton):

for (i in item) printf "%s\047%s\047", (++c>1 ? "," : ""), i
print ""

See it in action:

$ awk -F'~' 'NR>1{item[$5]} END {for (i in item) printf "%s\047%s\047", (++c>1 ? "," : ""), i; print ""}' file
'ARN_Code','BDA_CA_Code','ALL'

edited Sep 09 '17 at 20:05

Jonathan Leffler

730,956
141
904
1,278

answered Jul 29 '16 at 09:03

fedorqui

275,237
103
548
598

@sjsam how? I cannot see it. – fedorqui Jul 29 '16 at 10:03
You've probably gone through my answer, but I feel the single for-loop is more readable here. and time saving. And a ++ for addressing the issue – sjsam Jul 29 '16 at 10:12
@sjsam ah right, now I understand: instead of storing all the items in an array, you keep printing them as soon as you see them. This is also interesting, yes. – fedorqui Jul 29 '16 at 10:14
@sjsam ...It works like a charm.Thanks a lot.i tried it without loop as ........ sort | uniq | xargs | sed -e 's/ /","/g' | sed "s/\"/'/g" | sed -e "s/.*/'&'/") ...... Can you please suggest if this is faster than looping or not – Pavani Srujana Jul 29 '16 at 10:55
@fedorqui btw, I may be wrong with my previous assessment that `{item[$5]}`in your case is better than `$5 in field` which I used, both have the same complexity for `n` lines. haven't they? In your case the index can be added only after checking if it already exists. – sjsam Jul 29 '16 at 10:59
@PavaniSrujana : You mean to thank \@fedorqui. :) Both approaches do the same thing. And if I understand correctly both have the same the same complexity. But I prefer this solution to my (more idiomatic, or is it?) approach as it easily conveys the idea. – sjsam Jul 29 '16 at 11:05
@sjsam well, this is quite relative: my approach focuses on fetching the data and then printing it later on; yours does all in one shot. Also mine uses some tweaks to prevent using Bash later on to strip the trailing comma. But, as they say, everyone to his own taste : ) – fedorqui Jul 29 '16 at 11:44
1

@fedorqui--Thanks a lot – Pavani Srujana Jul 29 '16 at 12:24
1

That's a very reasonable approach with the caveats being memory usage and order of the output, but FWIW I'd write the END section as `awk -F'~' 'NR>1{item[$5]} END{for (i in item) printf "%s\047%s\047", (++c>1 ? "," : ""), i; print ""}' file`. IMHO `print "'\''" i "'\''"` is clearer written as `print "\047" i "\047"` btw. – Ed Morton Jul 29 '16 at 13:57
@EdMorton thanks for the explanations. Yes, `print "'\''"` seems like too much. [I see](http://stackoverflow.com/a/12845216/1983854) that also `\x27` does it. – fedorqui Aug 01 '16 at 08:19
1

You should always use octal, not hex, escape codes - see http://awk.freeshell.org/PrintASingleQuote. – Ed Morton Aug 01 '16 at 13:00
1

@EdMorton immediately favorited that page. The `awk 'BEGIN{print "\x27foo!\x27"}'` was clear enough : ) – fedorqui Aug 01 '16 at 13:07

score 3 · Answer 2 · edited Sep 20 '17 at 00:23

3

awk is your friend:

$ var=$(awk  -v FS="~" 'NR>1 && !($5 in field){printf "\047%s\047,",$5;field[$5]}' Err_Call_sipregtracking.csv)
$ var="${var%,}" #Stripping the trailing comma
$ echo "$var"
'BDA_CA_Code','ARN_Code','ALL'

Notes

I have used the octal \047 for single quote as suggested by Ed Morton in his comment. See the revision history.
Also check shell parameter expansion in the GNU documentation.

edited Sep 20 '17 at 00:23

Jonathan Leffler

730,956
141
904
1,278

answered Jul 29 '16 at 09:45

sjsam

21,411
5
55
102

I like how you use `-vq="'"` to print those single quotes. It looks easier to read. – fedorqui Jul 29 '16 at 10:03
@fedoroqui-- I am new to unix.This worked fine but i couldnot undersatnd how it is achieved.Could you please explain what is inside that printf statement – Pavani Srujana Jul 29 '16 at 12:28
1

1) Not leaving a space between `-v` and the variable name makes the script unnecessarily gawk-specific. 2) The idiomatic way to test for uniquness is an array named/populated as `!seen[$5]++`. 3) Without a terminating newline the output is not text per POSIX and so invites undefined behavior from any tool parsing it afterwards. 4) Don't add things then take them away again (e.g. the comma) as it's error prone. 5) You don't need to do shell operations to change awk output, just keep it in awk. 6) To get a single quote in an awk script use the octal `\047` - far less headaches than a variable. – Ed Morton Jul 29 '16 at 13:47
2

@fedorqui use of a variable for `'` makes it harder to write your scripts ingeneral. For example to find `'foo.bar'` would be `$0 ~ (q "foo\\.bar" q)` vs `/\047foo\.bar\047/`. Note the necessary extra escape in the first one plus the need to explicitly prepend with `$0 ~` and it's using string concatenation which is slow. – Ed Morton Jul 29 '16 at 14:03
@EdMorton : Thankyou and enlightened.!! Regarding point 3 did you mean terminating null character, I may do `END{printf ""}` then? – sjsam Jul 29 '16 at 14:10
1

You're welcome. Not a null character (`\0`), a newline (`\n` or, much less commonly for UNIX applications, `\r\n`). `printf ""` would produce neither of those, but `print ""` will produce the appropriate newline (as set in `ORS`). I just make the point that a newline might be `\n` or `\r\n` to show why you should use `print ""` (which uses the current/appropriate `ORS` setting) rather than hard-coding what you THINK a newline should be with `printf "\n"` in case you were considering doing that. – Ed Morton Jul 29 '16 at 14:17
@EdMorton : Is this still necessary, if you're storing the result to a variable like I did? Also, if time permits, pls give me a link to the posix article on this. – sjsam Jul 29 '16 at 14:23
2

To be honest, idk if a POSIX shell is required to be able to handle setting a variable from input that doesn't contain a terminating newline. I suspect not since per POSIX a line is "A sequence of zero or more non- characters **plus a terminating character**." but idk. The POSIX article is just the spec, see the discussion at http://stackoverflow.com/questions/729692/why-should-text-files-end-with-a-newline. – Ed Morton Jul 29 '16 at 14:35

score 3 · Answer 3 · answered Jul 29 '16 at 13:44

3

$ awk -F'~' 'NR>1 && !seen[$5]++{printf "%s\047%s\047", (NR>2 ? "," : ""), $5} END{print ""}' file
'BDA_CA_Code','ARN_Code','ALL'

answered Jul 29 '16 at 13:44

Ed Morton

188,023
17
78
185

pie3636 · Accepted Answer · 2016-07-29T09:18:46.393

2

This is probably not very optimized but works:

tail -n+2 Newfile.csv | awk -F'~' '{$5="\""$5"\""; print $5}' | uniq | tr '\n' ',' | sed 's/\,$/\n/'

If you want single quotes instead:

tail -n+2 Newfile.csv | awk -F'~' '{a = "'"'"'"; print a $5 a}' | uniq | tr '\n' ',' | sed 's/\,$/\n/'

Explanation:

tail -n+2 Newfile.csv omits the first line
awk -F'~' '{$5="\""$5"\""; print $5}' extracts the 5th column and surrounds it with quotes (for the single quotes, notice how unnecessarily complicated the quote printing is, there might be a way around this)
uniq removes duplicates
tr '\n' ',' replaces newlines with commas
sed 's/\,$/\n/' removes the final comma and replaces it with a newline (for output readability)

edited Jul 29 '16 at 09:18

answered Jul 29 '16 at 09:12

pie3636

795
17
31

3

it is interesting; however, note that `awk` can handle much of this internally. In general, piping so many commands is not considered good practice since it involves more CPU time. – fedorqui Jul 29 '16 at 09:18
2

That is true; I have however found that piping is however easier to explain step by step; besides, those commands extend far outside the range of `awk` and text processing, and for most usages, the CPU shouldn't be much of an issue here. That said, I understand your point. – pie3636 Jul 29 '16 at 09:26
2

Yes, I guess it is a matter of balance and, in fact, good thing of little commands is _one doing just one thing_. My current answer is a bit over-complicated since I wanted to just use `awk`. However, in your answer the first pipe, for example, could be removed by a simple `NR>1`, as well as `uniq` by placing items in an array. – fedorqui Jul 29 '16 at 09:29
The statement `those commands extend far outside the range of awk and text processing` is wrong. The commands needed for this are perfectly mundane for text processing and used regularly in awk. – Ed Morton Jul 29 '16 at 13:50

score 0 · Answer 5 · answered Jul 29 '16 at 11:51

You can skip the first line with sed 1d, get the 5th field with cut and use printf for marmatting the unique sorted results:

printf "'%s'\n" $(sed 1d Err_Call_sipregtracking.csv | cut -d~ -f5 | sort -u)

This doesn't fulfill your request to get it as a single line:

printf "'%s'," $(sed 1d Err_Call_sipregtracking.csv | cut -d~ -f5 | sort -u)|sed 's/,$//'

score 0 · Answer 6 · answered Jul 29 '16 at 18:50

0

Your command is correct but modify it a bit , like below:

cat Err_Call_sipregtracking| awk -F'~' '{print $5}' | uniq|sed 1d | sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}'

answered Jul 29 '16 at 18:50

Neethu Lalitha

3,031
4
35
60

Get comma separated distinct values in unix

6 Answers6