Validating unique values of a column in shell

Question

I get an input file vendor.csv which has a column called retailer. I have a predefined list of valid retailer values which are a,b,c. If 'd' comes in the retailer column I will have to take some action , mostly echo it to a log and stop the processing and notify the user.

I have done the following so far

f1=/stage/Scripts/ecommerce/vendor/final*.csv
k=`cut -d, -f1 $f1 |sort -u`
echo $k

This gives me

a b c d

The above o/p is not comma seperated

I can store the valid values a b c in a file or a string , for the above case

How do I make a check now ? Is this the best way to go about this

the valid values are ALB/SFY Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA

The existing data contains the following unique data points ALB/SFY Total Ecom TA Hy-Vee Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA

So the "Hy-Vee Total Ecom TA" is an invalid value.

Here is my attempt with grep

$ echo $s
ALB/SFY Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA

 echo $k
ALB/SFY Total Ecom TA Hy-Vee Total Ecom TA Peapod Total Ecom TA Target Total Ecom TA

grep -v "$s" "$k"

It gave me an error

grep: ALB/SFY Total Ecom TA
Hy-Vee Total Ecom TA
Peapod Total Ecom TA
Target Total Ecom TA: No such file or directory

Some of the solutions have pointed me in the right way, In R I would go about the above task as

valid_values = ['a','b','c']
invalid_retailer = unique(vendorfile$retailer) %not% in valid_values

I was trying to replicate the same process in shell, and hence my usage of cut and grep.

Your requirement to "take some action , mostly echo it and notify the user." is very unclear. Can you give more detail? And, also, post a small sample of your input? — amasmiller, Jan 19 '18 at 13:17
Like [in your previous question](https://stackoverflow.com/questions/48339963/pass-wildcard-to-cut-command-in-shell-script-and-store-it-in-a-variable) you are confusing yourself by not quoting the values you `echo`. You will find that the output is actually newline-separated. — tripleee, Jan 19 '18 at 13:24
The first argument to `grep` needs to be a regular expression. If the value `"$s"` actually contains newlines, you should be fine. But the second argument is a *file name,* not a string. To check a string, `printf "%s\n" "$k" | grep -v "$s"` or (if your shell is Bash) `grep -v "$s" <<<"$k"` — tripleee, Jan 19 '18 at 13:42
(But you previously stated you are using `ksh`; did you change shells between questions? I don't think `ksh` has `<<<` here strings.) — tripleee, Jan 19 '18 at 13:43
I know I have said this before, but you *really* should use double quotes around the values you `echo`. — tripleee, Jan 19 '18 at 13:45
As you have only an approx 33% acceptance rate on the answers you have received, I am not encouraged to spend my time devising a solution to this problem. You also seem to have trouble accepting advice from people that clearly do know what they are doing. — shellter, Jan 19 '18 at 16:34
@tripleee : Any modern installation of `ksh` has here strings. it was part of the `ksh93` improvements. Good luck. to all. — shellter, Jan 19 '18 at 16:35
@shellter Thanks for the update, and sorry for being too lazy to just check that myself. — tripleee, Jan 19 '18 at 16:50

score 3 · Answer 1 · answered Jan 19 '18 at 13:32

3

Try the awk command, which is much refined.

awk -F',' '{if (($1 == "a") || ($1 == "b") || ($1 == "c") || ($1 == "d")) print $0 }' /stage/Scripts/ecommerce/vendor/final*.csv

Other way::

We can give all the retailer id's in seperate file line by line , say in retailer.txt . The content of retailer.txt be like

a
b

in order to print those line's first field (seperated by ,) matching with the retailer id's in the retailer.txt, use the below command:

awk -F',' 'FNR==NR{$1=a[$1];next} ($1 in a)' retailer.txt final*.csv

answered Jan 19 '18 at 13:32

Riyas Siddikk

186
2
11

@riyan the problem is the file is huge , I am finding the unique retailers and validating that I am not getting any incorrect data points, your code will check for each record, which might be slow – av abhishiek Jan 19 '18 at 13:40
Uh, no? It reads the valid labels into an associative array, so as long as you have enough memory to keep them all in core, this should be pretty much as fast as it gets (or rather, probably I/O bound rather than CPU bound). – tripleee Jan 19 '18 at 13:44
(But the logic is inverted; you might want to print the ones which *aren't* valid. This prints the valid ones and omits the ones which are invalid.) – tripleee Jan 19 '18 at 13:46
yea, agree with @tripleee , but you can optimize the command further as below `awk -F',' 'FNR==NR{$1=a[$1]} ($1 in a)' retailer.txt final*.csv` Rather going line by line, it will check as a whole for the files. – Riyas Siddikk Jan 19 '18 at 13:47
Is `$1=a[$1]` a typo, though? It probably works sort of by accident, but what's the purpose of modifying and then immediately forgetting the first field on every line in the first file? – tripleee Jan 19 '18 at 13:48
@tripleee @abhishiek : If wants to check the negative case, to remove those are invalid, just add the NOT **!** in condition for the same. `awk -F',' 'FNR==NR{$1=a[$1]} !($1 in a)' retailer.txt final*.csv` – Riyas Siddikk Jan 19 '18 at 13:50
@tripleee : Yea got your point, sort of accident only. But it works for the retailer list which only has one column, right? – Riyas Siddikk Jan 19 '18 at 13:54
It works, but it's slightly inefficient, because Awk needs to re-parse the line when you assign something to a field. Just remove the `$1=` and you should be fine. – tripleee Jan 19 '18 at 14:12
The first variant could be refined to `$1 ~ /^(a|b|c|d)$/`; the `{ print $0 }` action is the default, so doesn't need to be specified explicitly at all. – tripleee Jan 19 '18 at 14:13

tripleee · Accepted Answer · 2018-01-19T14:08:51.030

Something like this perhaps?

awk -F, 'NR==FNR { ++a[$1]; next }
    !a[$1] { print FILENAME ":" FNR ": Invalid label " $1 >>"/dev/stderr" }' valid.txt final*.csv

where valid.txt contains your valid labels, one per line.

The general pattern of awk 'NR==FNR { ++a[$1] }' is a common way to read the first of a set of files into an array in memory and then in the remainder of the script perform some sort of join (in the database sense) with fields in the other input files. Awk simply processes one line at a time so the other files can be arbitrarily large really. You do need to be able to store the data from the first file in memory, though.

The advantage over your basic cut+grep attempt is that we can print the entire input line rather than just tell you which labels are invalid and have you go back and manually find out which lines in which files actually contained the violation.

Tangentially, your grep attempt has a number of issues. Firstly, if you are dealing with anything more than toy data, you want to avoid storing your data in shell variables. Secondly, you probably want to tweak your options to tell grep that you want to match text literally (-F -- without this, a.c matches abc because the dot is a regex wildcard character, for example) and that you want matches to cover an entire line (-x -- without this, b matches abc because it is a substring).

cut -d, -f1 final*.csv | sort -u |
grep -vxFf valid.txt

The -f filename option says to read the patterns from a file, and without another file name, grep processes standard input (from the pipe, in this case).

I will try to replicate, please forgive my earlier incorrect ways of using variables, there are so many ways to express variables in shell I just got confused, sometimes it feels deceptively easy and other times not so much, this is all because I am working on it after starting out in R I guess — av abhishiek, Jan 20 '18 at 07:10
Maybe see also https://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-shell-variable — tripleee, Jan 20 '18 at 09:52

romaric crailox · Answer 3 · 2018-01-19T13:27:37.053

-1

grep can't do what you want ?

If I understand, call grep on your csv file with the good regex can print all line with a wrong retailer. You need to choose a strong regex to prevent false positive match, but I need input example to help you...

Or, if a regex can't prevent false positive, you can use grep after cut command, like this :

for bad_retailer in $(cut -d, -f1 $f1 | grep d) ; do echo $bad_retailer ; done

with d the name of the bad retailer.

If you want to track more than 1 bad retailer, you can use grep -E "d|g|h", with d and g and h names of the bad retailers.

edited Jan 19 '18 at 13:27

answered Jan 19 '18 at 13:21

romaric crailox

564
1
3
15

The output from http://shellcheck.net/ reveals multiple things you need to fix here. Try it. – tripleee Jan 19 '18 at 17:00

Validating unique values of a column in shell

3 Answers3

Linked