How to process all or selected rows in a csv file where column headers and order are dynamic?

Question

I'd like to either process one row of a csv file or the whole file. The variables are set by the header row, which may be in any order. There may be up to 12 columns, but only 3 or 4 variables are needed.

The source files might be in either format, and all I want from both is lastname and country. I know of many different ways and tools to do it if the columns were fixed and always in the same order. But they're not.

examplesource.csv:

firstname,lastname,country
Linus,Torvalds,Finland
Linus,van Pelt,USA

examplesource2.csv:

lastname,age,country
Torvalds,66,Finland
van Pelt,7,USA

I have cobbled together something from various Stackoverflow postings which looks a bit voodoo but seems fairly robust. I say "voodoo" because shellcheck complains that, for example, "firstname is referenced but not assigned". And yet it prints it.

#!/bin/bash

#set the field seperator to newline
IFS=$'\n'
#split/transpose the first-line column titles to rows
COLUMNAMES=$(head -n1 examplesource.csv | tr ',' '\n')

#set an array and read the columns into it
columns=()
for line in $COLUMNAMES; do
    columns+=("$line")
done

#reset the field seperator
IFS=","

#using -p here to debug in output
declare -ap columns

#read from line 2 onwards
sed 1d examplesource.csv | while read "${columns[@]}"; do
    echo "${firstname} ${lastname} is from ${country}" 
done

In the case of looping through everything, it works perfectly for my needs and I can process within the "while read" loop. But to make it cleaner, I'd rather pass the current element(?) to an external function to process (not just echo).

And if I only wanted the array (current row) belonging to "Torvalds", I cannot find how to access that or even get its current index, eg: "if $wantedname && $lastname == $wantedname then call function with currentrow only otherwise loop all rows and call function".

I know there aren't multidimensional associative arrays in bash from reading Multidimensional associative arrays in Bash and I've tried to understand arrays from https://opensource.com/article/18/5/you-dont-know-bash-intro-bash-arrays

Is it clear what I'm trying to achieve in a bash-only manner and does the question make sense?

Many thanks.

I was using awk when the order of the columns was known. But now it isn't — digitaltoast, Apr 28 '19 at 23:44

vintnes · Answer 1 · 2019-04-29T04:29:09.920

2

GNU Awk has multidimensional arrays. It also has array sorting mechanisms, which I have not used here. Please comment if you are interested in pursuing this solution further. The following depends on consistent key names and line numbers across input files, but can handle an arbitrary number of fields and input files.

$ gawk -V |gawk NR==1

GNU Awk 4.1.4, API: 1.1 (GNU MPFR 3.1.5, GNU MP 6.1.2)

$ gawk -F, '
  FNR == 1 {for(f=1;f<=NF;f++) Key[f]=$f}
  FNR != 1 {for(f=1;f<=NF;f++) People[FNR][Key[f]]=$f}
  END {
    for(Person in People) {
      for(attribute in People[Person])
        output = output FS People[Person][attribute]
      print substr(output,2)
      output=""
    }
  }
' file*

66,Finland,Linus,Torvalds
7,USA,Linus,van Pelt

edited Apr 29 '19 at 04:29

answered Apr 29 '19 at 03:59

vintnes

2,014
7
16

Thank you - I can see a use for this at some point. Right now, I cannot fit it into the solution I was after with my current level of understanding, but many thanks! One question - why the * after the filename? I tried it without and it did the same thing. – digitaltoast Apr 29 '19 at 11:54
`file*` is a "Glob". `examplesource*.csv` would match `examplesource.csv` and `examplesource2.csv` – vintnes Apr 29 '19 at 15:23

score 2 · Accepted Answer · answered Apr 29 '19 at 08:15

Let's short your function. Don't read the source twice (first with head then with sed). You can do that once. Also the whole array reading can be shorten to just IFS=',' COLUMNAMES=($(head -n1 source.csv)). Here's a shorter version:

#!/bin/bash

cat examplesource.csv |
{
    IFS=',' read -r -a columnnames
    while IFS=',' read -r "${columnnames[@]}"; do
        echo "${firstname} ${lastname} is from ${country}" 
    done
}

If you want to parse both files and the same time, ie. join them, nothing simpler ;). First, let's number lines in the first file using nl -w1 -s,. Then we use join to join the files on the name of the people. Remember that join input needs to be sort-ed using proper fields. Then we sort the output with sort using the number from the first file. After that we can read all the data just like that:

# join the files, using `,` as the seaprator
# on the 3rd field from the first file and the first field from the second file
# the output should be first the fields from the first file, then the second file
# the country (field 1.4) is duplicated in 2.3, so just omiting it.
join -t, -13 -21 -o 1.1,1.2,1.3,2.2,2.3 <(
    # number the lines in the first file
    <examplesource.csv  nl -w1 -s, | 
    # there is one field more, sort using the 3rd field
    sort -t, -k3
) <(
    # sort the second file using the first field
    <examplesource2.csv sort -t, -k1
) |
# sort the output using the numbers from the first file
sort -t, -k1 -n |
# well, remove the numbers
cut -d, -f2- |
# just a normal read follows
{
    # read the headers
    IFS=, read -r -a names
    while IFS=, read -r "${names[@]}"; do
        # finally out output!
        echo "${firstname} ${lastname} is from ${country} and is so many ${age} years old!"
    done
}

Tested on tutorialspoint.

Thank you Kamil. Perfect and elegant solution which I've adapted into my code (see my answer) and it's working great! Above and beyond the call of duty - thank you! — digitaltoast, Apr 29 '19 at 11:58

David C. Rankin · Answer 3 · 2019-04-29T07:44:21.467

A bash solution takes a bit more work than an awk solution, but if this is an exercise over what bash provides, it provides all you need to handle determining the column holding the last name from the first line of input and then outputting the lastname from the remaining lines.

An easy approach is simply to read each line into a normal array and then loop over the elements of the first line to locate the column "lastname" appears in saving the column in a variable. You can then read each of the remaining lines the same way and output the lastname field by outputting the element at the saved column.

A short example would be:

#!/bin/bash

col=0       ## column count for lastname
cnt=0       ## line count
while IFS=',' read -a arr; do       ## read each line into array
    if [ "$cnt" -eq '0' ]; then     ## test if line-count is zero
        for ((i = 0; i < "${#arr[@]}"; i++)); do    ## loop for lastname
            [ "${arr[i]}" = 'lastname' ] &&         ## test for lastname
            { col=i; break; }       ## if found set cos = 1, break loop
        done
    fi
    [ "$cnt" -gt '0' ] &&                   ## if not headder row
    echo "line $cnt lastname: ${arr[col]}"  ## output lastname variable
    ((cnt++))       ## increment linecount
done < "$1"

Example Use/Output

Using your two files data files, the output would be:

$ bash readcsv.sh ex1.csv
line 1 lastname: Torvalds
line 2 lastname: van Pelt

$ bash readcsv.sh ex2.csv
line 1 lastname: Torvalds
line 2 lastname: van Pelt

A similar implementation using awk would be:

awk -F, -v col=1 '
    NR == 1 { 
        for (i in FN) { 
            if (i = "lastname") next 
        }
        col++ 
    }
    NR > 1 {
        print "lastname: ", $col 
    }
' ex1.csv

Example Use/Output

$ awk -F, -v col=1 'NR == 1 { for (i in FN) { if (i = "lastname") next } col++ } NR > 1 {print "lastname: ", $col }' ex1.csv
lastname:  Torvalds
lastname:  van Pelt

(output is the same for either file)

Thank you David. This is a very good solution to finding the row number - I've adapted into my code (see my answer below) and it's working great! I really wanted to give you the "solution" tick; it was 50/50 between you and Kamil. — digitaltoast, Apr 29 '19 at 11:56

digitaltoast · Answer 4 · 2019-04-29T11:59:39.090

Thank you all. I've taken a couple of bits from two answers

I used the answer from David to find the number of the row, then I used the elegantly simple solution from Kamil at to loop through what I need.

The result is exactly what I wanted. Thank you all.

    $ readexample.sh examplesource.csv "Torvalds"
    Everyone
    Linus Torvalds is from Finland
    Linus van Pelt is from USA

    now just Torvalds
    Linus Torvalds is from Finland

And this is the code - now that you know what I want it to do, if anyone can see any dangers or improvements, please let me know as I'm always learning. Thanks.

#!/bin/bash

FILENAME="$1"
WANTED="$2"

printDetails() {
    SINGLEROW="$1"
    [[ ! -z "$SINGLEROW" ]] && opt=("--expression" "1p" "--expression" "${SINGLEROW}p") || opt=("--expression" "1p" "--expression" "2,199p")

    sed -n "${opt[@]}" "$FILENAME" |
        {
            IFS=',' read -r -a columnnames
            while IFS=',' read -r "${columnnames[@]}"; do
                echo "${firstname} ${lastname} is from ${country}"
            done
        }
}

findRow() {
    col=0 ## column count for lastname
    cnt=0 ## line count
    while IFS=',' read -a arr; do ## read each line into array
        if [ "$cnt" -eq '0' ]; then ## test if line-count is zero
            for ((i = 0; i < "${#arr[@]}"; i++)); do ## loop for lastname
                [ "${arr[i]}" = 'lastname' ] && ## test for lastname
                    {
                        col=i
                        break
                    } ## if found set cos = 1, break loop
            done
        fi
        [ "$cnt" -gt '0' ] && ## if not headder row
            if [ "${arr[col]}" == "$1" ]; then
                echo "$cnt" ## output lastname variable
            fi
        ((cnt++)) ## increment linecount
    done <"$FILENAME"
}

echo "Everyone"
printDetails

if [ ! -z "${WANTED}" ]; then
    echo -e "\nnow just ${WANTED}"
    row=$(findRow "${WANTED}")
    printDetails "$((row + 1))"
fi

How to process all or selected rows in a csv file where column headers and order are dynamic?

4 Answers4

Linked