0

I have a CSV file with columns A-O. 500k rows. In Bash I would like to loop through each column, get distinct values and output them to a file:

sort -k1 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f1 -d , | uniq > EMPLOYEEID.csv
sort -k2 -n -t, -o CROWN.csv CROWN.csv && cat CROWN.csv | cut -f2 -d , | uniq > SORTNAME.csv

This works, but to me is very manual and not really scalable if there were like 100 columns.

The code sorts the column in-place and then the column specified is passed to uniq to get distinct values and is then outputted.

NB: The first row has the header information.

The above code works, but I'm looking to streamline it somewhat.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
pee2pee
  • 3,619
  • 7
  • 52
  • 133
  • `sort -k2 -n -t, -o CROWN.csv CROWN.csv` reading and witting in the same file is risky, may result in data corruption, especially since sort doesn't seem to have some sort of in-place flag. – Sorin Sep 15 '19 at 09:58
  • are the do any of the csv fields contain escaped `,` ? – Sorin Sep 15 '19 at 09:59
  • No commas in the CSV fields – pee2pee Sep 15 '19 at 10:02
  • 1
    Please add sample input (no descriptions, no images, no links) and your desired output for that sample input to your question (no comment). – Cyrus Sep 15 '19 at 10:05
  • Overwriting the file with each `sort` is humongously inefficient. Just write to standard output and replace the `cat`s (which by themselves are [useless](/questions/11710552/useless-use-of-cat) anyway). – tripleee Sep 15 '19 at 10:05
  • If you have enough memory, you can traverse the file just once. Awk doesn't support nested arrays so I guess you'll need Perl or Python. But basically fetch the column names from the first line, open a file handle for each (does your system allow 100 open file handles though?) and keep a hash of the already-seen values for each column. – tripleee Sep 15 '19 at 10:08
  • @tripleee I think you don't need nested arrays for this case. It sounds like a trivial task for awk but we can't be sure without seeing i/o examples – oguz ismail Sep 15 '19 at 10:19
  • 1
    @Cyrus I think that it's pretty obvious what is required, a sample will help, but it's not necessary – Sorin Sep 15 '19 at 10:35
  • 1
    @oguz Thanks for the suggestion; updated my answer with a simple attempt at doing all columns in a single pass. – tripleee Sep 15 '19 at 10:44

2 Answers2

3

Assuming headers can be used as file names for each column:

head -1 test.csv | \
tr "," "\n" | \
sed "s/ /_/g" | \
nl -ba -s$'\t' | \
while IFS=$'\t' read field name; do
  cut -f$field -d',' test.csv | \
  tail -n +2 | sort -u > "${name}.csv" ; 
done

Explanation:

  • head - reads the first line
  • tr- replaces the , with new line
  • sed - replaces white space with _ for cleaner file names (tr would work also, and you can combine with previous one then, but if you need more complex transforms use sed)
  • nl - adds the field number
    • -ba - number all lines
    • -s$'\t' - set the separator to tab (not necessary, as it default, but for clarity sake)
  • while- reads trough field number/names
  • cut - selects the field
  • tail - removes the heading, not all tails have this option, you can replace with sed
  • sort -u - sorts and removes duplicates
  • >"$name.csv" - saves in the appropriate file name

note: this assumes that there are no , int the fields, otherwise you will need to use a csv parser

Sorin
  • 5,201
  • 2
  • 18
  • 45
1

Doing all the columns in a single pass is much more efficient than rescanning the entire input file for each column.

awk -F , 'NR==1 { ncols = split($0, cols, /,/); next }
    { for(i=1; i<=ncols; ++i)
        if (!seen[i ":" $i])
            print $i >>cols[i] ".csv"}' CROWN.csv

If this is going to be part of a bigger task, maybe split the input file into several temporary files with fewer columns than the number of open file handles permitted on your system, rather than fix this script to handle an arbitrary number of columns.

You can inspect this system constant with ulimit -n; on some systems, you can increase it either by tweaking the system configuration or, in the worst case, by recompiling the kernel. (Your question doesn't identify your platform, but this should be easy enough to google.)


Addendum: I created a quick and dirty timing comparison of these answers at https://ideone.com/dnFj41; I encourage you to fork it and experiment with different shapes of input data. With an input file of 100 columns and (probably) no duplication in the columns -- but only a few hundred rows -- I got the following results:

  • 0.001s Baseline test -- simply copy input file to an identical output file
  • 0.242s tripleee -- this single-pass AWK script
  • 0.561s Sorin -- multiple passes using simple shell script
  • 2.154s Mihir -- multiple passes using AWK

Unfortunately, Carmen's answer could not be tested, because I did not have permissions to install Text::CSV_XS on Ideone.

An earlier version of this answer contained a Python attempt, but I was too lazy to finish debugging it. It's still there in the edit history if you are curious.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
tripleee
  • 175,061
  • 34
  • 275
  • 318
  • care for a friendly competition ? I don't have python handy at the moment, but latter in the evening I can bench mark it. 15 columns 500k lines ?. Winner buys himself a beer ;) – Sorin Sep 15 '19 at 11:42
  • Results will differ depending on number of columns and available memory (how wide columns? How much duplication?). But sure, I'm game. – tripleee Sep 15 '19 at 14:30