I got two huge comma delimited files.
The 1st file has 280 million lines and the following columns
first name, last name, city, state, ID, email*, phone
John,Smith,LA,CA,123123123123,johnsmith@yahoo.com,12312312
Bob,Marble,SF,CA,120947810924,,48595920
Tai,Nguyen,SD,CA,134124124124,tainguyen@gmail.com,12041284
The 2nd file has 420 million lines and the following columns
first name, last name, city, state, email
John,Smith,LA,CA,johnsmith@hotmail.com
Bob,Marble,SF,CA,bobmarble@gmail.com
Tai,Nguyen,SD,CA,tainguyen@gmail.com
* a lot of these fields are empty
I want to merge all the lines from both files that has the first 4 columns match. Then fill in the missing emails of the first file by the emails from the second files if the email is not blank then don't change it. The process should be case insensitive. In case there are many instances that have the same 4 information then just ignore these instance and do the work on unique instances only.
The result should have the following columns and look like this
first name, last name, city, state, ID, email, phone
John,Smith,LA,CA,123123123123,johnsmith@yahoo.com,12312312
Bob,Marble,SF,CA,120947810924,bobmarble@gmail.com,48595920
Tai,Nguyen,SD,CA,134124124124,tainguyen@gmail.com,12041284
They should only print out things that has 4 columns matched not 1 or 2 or 3. My boss insist on using Bash shell script for this and I am a newbie in Bash. Please help me with a clear explanation as I am so newbie.
I do my reading and understand that awk require storing information into cpu memory. However, I can split the big files into small files and use awk in that case. I copy some code online and change it to my need but whenever it fills in the blank email, it also reformats the line delimiter from comma into space. I want to stop that but don't know how. Please help me to solve this problem. All advises and answers are highly appreciated.
awk -F "," 'NR==FNR{a[$1,$2,$3,$4]=$5;next}{if ($6 =="") $6=a[$1,$2,$3,$4];print}' file2.txt file1.txt > file3.txt