Aggregate CSV data based on unique ID using command-line tools

Question

I would like to aggregate data from one CSV file into another, using command-line tools such as bash, sed, awk etc. Each row in each file begins with a unique ID in column 1, and if a row in the destination matches this ID, the line should be replaced, otherwise it should be appended. The input data is not sorted, but the sort order of the result doesn't matter so files can be sorted as part of the script if it helps.

For example, given the current state of aggregate.csv:

1,olddata
3,olddata
2,olddata

And the file new.csv:

4,newdata
2,changeddata
3,changeddata

I would like aggregate.csv to come out as follows (in any sort order):

1,olddata
2,changeddata
3,changeddata
4,newdata

The lines may contain a large number of columns so replacing one cell at a time is not ideal. The CSVs are guaranteed not to contain quoted newlines, so a line-by-line search and replacing a whole line at a time is a valid approach.

Apologies if this is a duplicate but I can't find another question that exactly uses this CSV merging approach.

I have tried adapting the answers to this question but it required first generating the "template file" of all IDs by parsing both files line-by-line, sorting, removing duplicates, and saving - I hoped that a simpler method was possible.

This question has answers in sed and awk that I also replicated, and managed the regex-replacement part but not a method to append a new line to the file if a match did not exist.

Thanks!

need some idea of the volumes of data being processed to determine if a (easy) in-memory solution will work or if we need to look at (more) complicated solutions to get around potential out-of-memory issues; how many lines in each file? got a ballpark figure on the max width of these lines? re: lines with *'a large number of columns'* ... are you looking to replace the entire line or just certain cells? will the source and target files have the same number of cells/fields? — markp-fuso, Jan 31 '21 at 18:11
@markp-fuso For a rough idea, the files I'm testing with have ~30000 lines, each of which contains 20 fields and ~300 characters. (I guess that may not be what everyone would consider a "large number of columns" but I wanted to avoid solutions that would have to do a task 20 times over 20 fields.) I'm looking to replace the whole line. Source and target files will have the same number of columns, but differing numbers of rows. — Ian Renton, Jan 31 '21 at 18:19
so max file sizes are ~9MB (give or take a few MB)? replace all or part of the line? — markp-fuso, Jan 31 '21 at 18:20
This is a wish, not a question regarding anything you have done already. — , Jan 31 '21 at 18:36
The whole line. Yes, ~9MB is about the right order of magnitude. Obviously `aggregate.csv` will grow a bit as several `new.csv`s are merged into it, but the majority of lines should be a replace rather than an append, so I don't think it will grow that much. — Ian Renton, Jan 31 '21 at 18:37
@Roadowl I have tried various approaches trying to adapt solutions like [1](https://stackoverflow.com/questions/29528607/search-for-a-pattern-in-column-in-a-csv-and-replace-another-pattern-in-the-same) [2](https://stackoverflow.com/questions/8822097/how-to-replace-a-whole-line-with-sed) [3](https://stackoverflow.com/questions/27600967/merge-csv-files-using-join-awk-sed) without much luck - I have reached the "trying combinations of things I don't really understand" stage of command-line awk usage! — Ian Renton, Jan 31 '21 at 18:42
It wouldn't hurt to update the question with some of your (coding) attempts; we could then make suggestions on (code) corrections — markp-fuso, Jan 31 '21 at 18:46

markp-fuso · Answer 1 · 2021-01-31T18:48:59.900

Assumptions:

replace entire lines
total file size is ~8-10 MB (so likely won't have any out-of-memory issues)
field #1 is unique within each file
no sorting requirements for output
input not guaranteed to be sorted

One awk solution:

awk -F"," '                      # input delimiter is comma ","

FNR==NR { a[$1]=$0 ; next }      # first file: store contents in array a[], using field #1 as index

        { if ( $1 in a )         # second file: if field #1 is an index in array a[] then
             { print a[$1]       # print the contents of said array entry and
               delete a[$1]      # remove entry from array
             }
          else                   # field #1 is not an index in array a[] so
             print $0            # print current line
        }

END     { for ( i in a )         # anything still in array a[] was not in second file so
              print a[i]         # print these entries
        }
' new.csv aggregate.csv          # order of input files is important !!!

This geneates:

1,olddata
3,changeddata
2,changeddata
4,newdata

NOTE: While the output appears to be ordered by field #1, this is merely a coincidence based on the ordering of the input data; there's nothing in this awk code that explicitly generated this 'ordered' output.

This will not overwrite the aggregate.csv file. While it's possible to use the GNU awk -i inplace option to overwrite aggregate.csv, this would require a change in processing to keep from overwriting new.csv. The easiest solution will likely be to direct this output to a new file and then OP can decide if said new file should be used to overwrite aggregate.csv.

Brilliant, thanks for a fully commented solution as well - very helpful for me to understand awk better in future! — Ian Renton, Jan 31 '21 at 18:46

score 4 · Answer 2 · answered Jan 31 '21 at 19:08

4

With any awk:

$ awk -F, '!seen[$1]++' new agg
4,newdata
2,changeddata
3,changeddata
1,olddata

or with GNU sort for -s:

$ sort -ust, -k1,1 new agg
1,olddata
2,changeddata
3,changeddata
4,newdata

answered Jan 31 '21 at 19:08

Ed Morton

188,023
17
78
185

1

Another great answer, thanks! I'm not sure which one to mark as accepted - this one for its simplicity or the other for its explanatory comments, but both do seem to work. – Ian Renton Jan 31 '21 at 19:26
2

Do you understand them both? If not, feel free to ask questions. If so then which one are you going to use in your code? In general you shouldn't just accept the first answer you get as an answer that produces the expected output from a given sample input set is the starting point to identifying the best (or even a correct) solution, not the end point, and accepting any answer discourages other people from posting alternative answers so then it's your loss. It's usually best to wait a few hours or a day to see what answers you get and THEN accepting one. – Ed Morton Jan 31 '21 at 19:28
Fair point! I do understand your answers now I've done some more reading. I hadn't previously encountered the "seen" command for awk, and I also didn't realise that sort with a -k flag would consider only that field when testing for uniqueness rather than just for determining sort order. Obvious with hindsight! – Ian Renton Jan 31 '21 at 20:13
@IanRenton `seen` isn't a command; `seen` is the name of an array – markp-fuso Jan 31 '21 at 20:29
1

Ian @markp-fuso is correct, I could have named that array `fluffybunny[]` and the code would have behaved the same way, we just idiomatically name the array `seen[]` when it's being used to tell the first occurrence of it's index from any subsequent occurrence of that same index. There must be thousands of scripts online using that idiom, see https://stackoverflow.com/questions/57040414/how-to-get-the-unique-elements-of-the-first-column-and-store-it-in-an-array#comment100610137_57040901 for more info. – Ed Morton Jan 31 '21 at 22:51
2

Ah OK, so '!seen[$1]++' alone is enough to say "if field 1 doesn't already exist in the array, then add it to the array and carry on processing, but if it does, stop processing this line" without anything special about "seen". Gotcha, thanks! – Ian Renton Feb 01 '21 at 09:07
Close enough. It's technically saying "if the value of `seen[$1]` is zero or null (as it would be if field 1 didn't exist in the array) then increment the value of `seen[$1]` and carry on processing, but if it's not zero or null then stop processing this line". The tiny difference is that `seen[$1]` could be zero or null if you set it to zero or null even if `$1` existed in the array or addied it to the array without settiing it (e.g. just `seen[$1]` on it's own would do that) but you can avoid that happening by just not writing that code! – Ed Morton Feb 01 '21 at 16:03

Aggregate CSV data based on unique ID using command-line tools

2 Answers2