1

I have a file script.sh:

script.sh

cd /folder
mv a.csv result.csv

The a.csv file will have lots of records(GBs) in form:

id,name
1,"platinum"
2,"joe"
1,"platinum"
...

What I want to do is create a file using a script called records.txt, which will have a total no of records and records with unique IDs.

records.txt

Total Records: 3
Unique Records: 2

Total excluding id and name.

I want to do this via script after mv. How can I do it via the script?

Avenger
  • 793
  • 11
  • 31
  • 1
    Please do add your efforts in form of code in your question, which is highly encouraged on SO. – RavinderSingh13 Dec 11 '20 at 13:29
  • 1
    Please add more details to your question. What exactly means "records with unique IDs"? In your example you have two identic lines `1,"platinum"`. Can the file have two lines with the same ID but different name fields? What is the expected result in this case? Example: two lines `3,"foo"` and `3,"bar"`. – Bodo Dec 11 '20 at 13:50

3 Answers3

1

Use bash sort with unique (why both?) to get the unique values, use wc -l to count those:

#!/bin/bash

total=$(tail -n +2 tst.csv | wc -l)
unique=$(tail -n +2 tst.csv | sort | uniq | wc -l)

echo "Total Records: ${total}"
echo "Unique Records: ${unique}"

Total Records: 2

Unique Records: 3


NOTE; using tail -n +2 to skip the first line of the CSV, since you don't want to count those.

0stone0
  • 34,288
  • 4
  • 39
  • 64
  • The problem here is that the file isn't necessarily sorted -- hence, you need to sort the lines before applying `uniq`. – costaparas Dec 11 '20 at 13:33
  • @costaparas Thanks for pointing that out, I've posted outdated attempt ;) Should be fine now :D – 0stone0 Dec 11 '20 at 13:36
0

Your script.sh could look like this:

cd /folder
mv a.csv result.csv

total=$((`wc -l < result.csv` - 1))
unique=$((`sort result.csv | uniq | wc -l` - 1))

cat > records.txt <<eof
Total Records: $total
Unique Records: $unique
eof

This just uses a simple pair of pipelines to count how many lines there are -- using wc.

Note, we subtract 1 in each case because of the header line.

Also, in the case of the unique count, we use sort followed by uniq to correctly remove the duplicates.

The counts are then exported to the records.txt file.

(Note I've used backticks here purely to avoid having too many parentheses, you can instead use $(...) command substitution syntax, but its not essential here since there's no nesting required).

costaparas
  • 5,047
  • 11
  • 16
  • 26
  • using `$(( numeric_producing_cmds))` mean you can use `$( any_cmd)` rather than ancient back-tics. So please, `$(($(cat result.csv | wc -l) - 1))`, and with fewer process invocations `$(( $( wc -l < result.csv) - 1))`. Good luck to all. – shellter Dec 11 '20 at 14:04
  • I disagree -- using `$(...)` inside `$((...))` makes it *less* readable, hence the use of back-ticks in this case. As for the 2nd point, I've updated in the answer to address the inefficiency. – costaparas Dec 11 '20 at 14:11
  • Well, maybe, `$(($(cat...` is harder to read. I think adding the spaces as in the 2nd example is easier to read. But of course, we're all entitied to our opinions. When you get to the point in scripting where you need to nest cmd-substitution, using back-tics will a challenge (not impossible), just much easier with `$( cmd1 $(cmd 2) )` . Good luck to all. – shellter Dec 11 '20 at 14:23
  • And, as a general answer, the original `#!/bin/sh` doesn't allow `$(( ))`, only `expr` and backtics, so this is won't help someone using a true `#!/bin/sh`, so I can't upvote this. – shellter Dec 11 '20 at 14:27
  • Of course, for nested commands, you'd go for nesting `$(..)`. But even in that case, using back-ticks on the inner-most nested command, makes it more readable. – costaparas Dec 11 '20 at 14:27
0

Awk alternative:

awk -F, '{ fil[$1]++ } END { for (i in fil) { tot++;if (fil[i] == 1) { utot++ } } print "Total Records: "tot;print "Unique Records: "utot }' results.csv > records.txt

Set the field separator to , and then set up an array with the id as the index and an incremented value. At the end, loop through the array and creating a running total (tot) and a unique total (utot) where the count in the array is one. Print both values.

Raman Sailopal
  • 12,320
  • 2
  • 11
  • 18