0

I have a CSV export that I need to map to new values to in order to then import into a different system. I am using ArangoDB to create this data migration mapping.

Below is the full script used:

#!/bin/bash

execute () {
    filepath=$1
    prefix=$2
    keyField=$3
    filename=`basename "${filename%.csv}"`
    collection="$prefix$filename"
    filepath="/data-migration/$filepath"

    # Check for "_key" column
    if ! xsv headers "$1" | grep -q _key
    # Add "_key" column using the keyfield provided
    then
        xsv select $keyField "$1" | sed -e "1s/$keyField/_key/" > "$1._key"
        xsv cat columns "$1" "$1._key" > "$1.cat"
        mv "$1.cat" "$1"
        rm "$1._key"
    fi

    # Import CSV into Arango Collection
    docker exec arango arangoimp --collection "$collection" --type csv "$filepath" --server.password ''
}

# This single line runs the execute() above
execute 'myDirectory/myFile.csv' prefix_ OLD_ORG_ID__C

So far I've deduced the $keyField (OLD_ORG_ID__C) parameter passed to the execute() function, is used in the loop of the script. This looks for $keyField column and then migrates the values to a newly created _key column using the XSV toolkit.

OLD_ORG_ID__C | _key
  A123        ->  A123
  B123        ->  B123
              ->  ##    <-auto populate

Unfortunately not every row has a value for the OLD_ORG_ID__C column and as a result the _key for that row is also empty which then causes the import to Arango to fail.

Note: This _key field is necessary for my AQL scripts to work properly

How can I rewrite the loop to auto-index the blank values?

then
    xsv select $keyField "$1" | sed -e "1s/$keyField/_key/" > "$1._key"
    xsv cat columns "$1" "$1._key" > "$1.cat"
    mv "$1.cat" "$1"
    rm "$1._key"
fi

Is there a better way to solve this issue? Perhaps xsv sort by the keyField and then auto populate the from the blank rows to the end?

UPDATE: Per the comments/answer I tried something along these lines but so far still not working

#!/bin/bash

execute () {
    filepath=$1
    prefix=$2
    keyField=$3
    filename=`basename "${filename%.csv}"`
    collection="$prefix$filename"
    filepath="/data-migration/$filepath"

    # Check for "_key" column
    if ! xsv headers "$1" | grep -q _key
    # Add "_key" column using the keyfield provided
    then

        awk -F, 'NR==1 { for(i=1; i<=NF;++i) if ($i == "'$keyField'") field=i; print; next }
    $field == "" { $field = "_generated_" ++n }1' $1 > $1-test.csv

    fi

}
# import a single collection if needed
execute 'agas/Account.csv' agas_ OLD_ORG_ID__C

This creates a Account-test.csv file but unfortunately it does not have the "_key" column or and changes to the OLD_ORG_ID__C values. Preferably I would only want to see the "_key" values populated with auto-numbered values when OLD_ORG_ID__C is blank, otherwise they should copy the provided value.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
Xtremefaith
  • 907
  • 1
  • 9
  • 29
  • The [`cat` of a here document is poor style](https://stackoverflow.com/questions/3005963/how-can-i-have-a-newline-in-a-string-in-sh) but other than that, without more information about your input data and the error you are trying to debug, I don't think anybody here can help. Try to make the input data well-formed in the first place. – tripleee Nov 27 '18 at 07:52
  • @tripleee I updated the question to helpfully provide more clarity about my question now that've I've figured out some of it I think I've narrowed it down a bit. – Xtremefaith Nov 27 '18 at 08:14
  • Without seeing the actual offending document or any of your data, we are still in the dark. Could you try to pare this down to a [mcve] using some simple mock data? – tripleee Nov 27 '18 at 08:15
  • I've simplified my question as far as I can understand, I don't typically write in bash and I'm not super familiar with ArangoDB or XSV, these were implemented by a previous developer. – Xtremefaith Nov 27 '18 at 08:39
  • If your question is "how can I put a unique value in the first column if it is empty" try something like `awk -F '\t' '$1 == "" { $1 = "_generated_" ++i }1' file >newfile` (assuming tab-separated TSV). – tripleee Nov 27 '18 at 08:57
  • Awk doesn't have a variable named `OLD_ORG_ID__C` so the script can't use that. You want literally `$1` to refer to the first field – tripleee Nov 27 '18 at 12:36
  • Well the column is going to change depending on the file being run at the time so I need to get the column number by the `$keyField` parameter that I pass. I found this (`awk -F, '{ for (i=1;i<=NF;i++) if ($i == "OLD_ORG_ID__C") print i }' Account.csv`) and it works so I think I need to pipe this into what you have suggested (not sure how to yet) – Xtremefaith Nov 27 '18 at 18:19
  • No need to pipe anything; just add that processing to extract the field index on the first line; i.e. `awk -F, 'NR==1 { for(i=1; i<=NF;++i) if ($i == "OLD_ORG_ID__C") field=i ; next } ...'` and now you can use `$field` to refer to this field when processing subsequent lines. – tripleee Nov 27 '18 at 18:53

1 Answers1

0

If your question is "how can I find from the first header line of a CSV file which field is named OLD_ORG_ID__C, then on subsequent lines put a unique value in this column if it is empty" try something like

awk -F, 'NR==1 { for(i=1; i<=NF;++i) if ($i == "OLD_ORG_ID__C") field=i ; print; next }
    $field == "" { $field = "_generated_" ++n }1' file >newfile

This has no provision for coping with complexities like quoted fields with embedded commas. (I have no idea what xsv is but maybe it would be better equipped for such scenarios?)

If I can guess what this code does

xsv select $keyField "$1" |
sed -e "1s/$keyField/_key/" > "$1._key"

then probably you could replace it with something like

xsv select "$keyField" "$1" |
awk -v field="$keyField" 'NR==1 { $0 = field }
    /^$/ { $0 = NR } 1' >"$1._key"

to replace the first line with the value of $keyField and replace any subsequent empty lines with their line number.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • You are really close, I modified the header like so: `{ $0 = "_key" }` the only remaining issue is that the subsequent empty lines are not being updated with their line number – Xtremefaith Nov 28 '18 at 20:28
  • I rolled back the "correction" becaue it didn't seem to match the information you have supplied in the question. It's still visible in the [edit history](/posts/53506480/revisions). – tripleee Nov 30 '18 at 02:51
  • Ok?... well what I posted worked, what you now have does not. `{ $0 = field }` is not right, header must be static value that is arango dependent "_key". And I don't understand `/^$/` but considering the empty values are populated with `""` when I reviewed the file I modified it to `/""/` and it worked – Xtremefaith Dec 02 '18 at 03:04
  • The prose part explains what speculations the code is based on. If you change the code, the answer is no longer consistent. You are more than welcome to post, and even accept, a separate answer where you point out where these speculations are wrong. – tripleee Dec 02 '18 at 09:55