1

I got two huge comma delimited files.
The 1st file has 280 million lines and the following columns

first name, last name, city, state, ID, email*, phone

John,Smith,LA,CA,123123123123,johnsmith@yahoo.com,12312312
Bob,Marble,SF,CA,120947810924,,48595920
Tai,Nguyen,SD,CA,134124124124,tainguyen@gmail.com,12041284

The 2nd file has 420 million lines and the following columns

first name, last name, city, state, email
John,Smith,LA,CA,johnsmith@hotmail.com
Bob,Marble,SF,CA,bobmarble@gmail.com
Tai,Nguyen,SD,CA,tainguyen@gmail.com

* a lot of these fields are empty

I want to merge all the lines from both files that has the first 4 columns match. Then fill in the missing emails of the first file by the emails from the second files if the email is not blank then don't change it. The process should be case insensitive. In case there are many instances that have the same 4 information then just ignore these instance and do the work on unique instances only.

The result should have the following columns and look like this

first name, last name, city, state, ID, email, phone
John,Smith,LA,CA,123123123123,johnsmith@yahoo.com,12312312
Bob,Marble,SF,CA,120947810924,bobmarble@gmail.com,48595920
Tai,Nguyen,SD,CA,134124124124,tainguyen@gmail.com,12041284    

They should only print out things that has 4 columns matched not 1 or 2 or 3. My boss insist on using Bash shell script for this and I am a newbie in Bash. Please help me with a clear explanation as I am so newbie.

I do my reading and understand that awk require storing information into cpu memory. However, I can split the big files into small files and use awk in that case. I copy some code online and change it to my need but whenever it fills in the blank email, it also reformats the line delimiter from comma into space. I want to stop that but don't know how. Please help me to solve this problem. All advises and answers are highly appreciated.

awk -F "," 'NR==FNR{a[$1,$2,$3,$4]=$5;next}{if ($6 =="") $6=a[$1,$2,$3,$4];print}' file2.txt file1.txt > file3.txt
Phat Tran
  • 21
  • 2
  • 5
    would be nice to put an example of what you exactly want: 3 first lines of the two files with expected output. – Picaud Vincent May 02 '19 at 10:09
  • 1
    Which email should be used in case both email fields of matching lines are non-empty and differ from each other? – Socowi May 02 '19 at 10:42
  • Are these four key columns unique in each file? Only one `John,Doe,NYC,New York` can ever appear in a given one? – Shawn May 02 '19 at 12:58
  • it is possible that there are more than one John,Doe,NYC,New York if that is the case then ignore it I would say. Thanks for the questions guys. I provided the example as well as emails details in the question. – Phat Tran May 03 '19 at 04:52

1 Answers1

3

The awk approach you showed is not suited for files that big. It stores parts of the files in memory. With the same approach you would need to store either ... or ...

  • 280 million entries of the form first name, last name, city, stateID, phone
  • 420 million entries of the form first name, last name, city, stateemail

Assume we go with the first option and each entry takes up only 50 bytes of memory. To store all 280 million entries we need 280M·50B = 14'000 MB = 14 GB. This is the absolute minimum of memory you need to run the awk command. In reality it would be even more due to implementation details of associative arrays.

What you can do instead

Use the classical approach to the problem:

  1. sort both files
  2. join the files by their first four columns*
  3. cut the desired columns from the joined result**

* needs some pre- and post-processing as join can only join one column.
** Since we have to re-arrange the email column cut is not sufficient. We can use awk instead.

#! /bin/bash
prefixWithKey() {
    sed -E 's/([^,]*,){4}/\L&\E\t&/' "$1"
}
sortByKeyInPlace() {
    sort -t $'\t' -k1,1 -o "$1" "$1"
}
joinByKey() {
    join -t $'\t' "$1" "$2"
}
cutColumns() {
    awk 'BEGIN{FS="\t|,\t*"; OFS=","} {print $5,$6,$7,$8,$9,$16,$11}'
}

file1="your 1st input file.csv"
file2="your 2nd input file.csv"
for i in "$file1" "$file2"; do
   prefixWithKey "$i" > "$i.tmp"
   sortByKeyInPlace "$i.tmp"
done
joinByKey "$file1.tmp" "$file2.tmp" | cutColumns > result.csv
rm "$file1.tmp" "$file2.tmp"

This script assumes that the input files have no headers and contain no tabs. We always take the email field from the 2nd file, no matter whether the email field of the 1st file was defined or not.

I barely tested this script because you didn't provide any example input. If you encounter some errors and share a short input leading to that error I would be happy to fix the script (if it needs fixing).

In theory the script could be written without temporary files. I intentionally used temporary files because of the input size. Programs like sort may run faster on files.

This script could be speed up, for instance by

  • Executing both calls to prefixWithKey in parallel.
  • Adding LC_ALL=C in front of commands like sort.
  • Adding options to sort, for instance -S 70%.

Further Alternatives

For files that big it could be faster to store them into a database and process them there. There is even the tool q for doing thinks like this in a single command, but from what I experienced it's very slow.

Socowi
  • 25,550
  • 3
  • 32
  • 54
  • I keep wishing join supported multi-column join fields. Hmm... shouldn't be too hard to write a version that does. \*Throws it on the idea pile\*. – Shawn May 02 '19 at 13:03