5

I have 2 tabular files. One file contains a mapping of 50 key values only called lookup_file.txt. The other file has the actual tabular data with 30 columns and millions of rows. data.txt I would like to replace the id column of the second file with the values from the lookup_file.txt..

How can I do this? I would prefer using awk in bash script.. Also, Is there a hashmap data-structure i can use in bash for storing the 50 key/values rather than another file?

Chris Seymour
  • 83,387
  • 30
  • 160
  • 202
user836087
  • 2,271
  • 8
  • 23
  • 33
  • 1
    you can use declare -A dict as dictionary in bash, and assign the values like dict=( ["key"]="value" ) – Saddam Abu Ghaida Nov 26 '12 at 17:26
  • 1
    It might be easier to import into sqlite, join, and export rather than writing a whole lot of bash, grep, sed and awk. – lynks Nov 26 '12 at 17:27
  • trival to use awk to load lookup_file.txt as an assoc array, then read thru data.txt replacing as needed. Don't have time now, others will show the way. Good luck to all. – shellter Nov 26 '12 at 17:33
  • The millions of rows makes this an interesting problem. Maybe you could give us some sample data? – ddoxey Nov 27 '12 at 06:48

4 Answers4

7

Assuming your files have comma-separated fields and the "id column" is field 3:

awk '
BEGIN{ FS=OFS="," }
NR==FNR { map[$1] = $2; next }
{ $3 = map[$3]; print }
' lookup_file.txt data.txt

If any of those assumptions are wrong, clue us in if the fix isn't obvious...

EDIT: and if you want to avoid the (IMHO negligible) NR==FNR test performance impact, this would be one of those every rare cases when use of getline is appropriate:

awk '
BEGIN{
   FS=OFS=","
   while ( (getline line < "lookup_file.txt") > 0 ) {
      split(line,f)
      map[f[1]] = f[2]
   }
}
{ $3 = map[$3]; print }
' data.txt
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
5

You could use a mix of "sort" and "join" via bash instead of having to write it in awk/sed and it is likely to be even faster:

key.cvs (id, name)

1,homer
2,marge
3,bart
4,lisa
5,maggie

data.cvs (name,animal,owner,age)

snowball,dog,3,1
frosty,yeti,1,245
cujo,dog,5,4

Now, you need to sort both files first on the user id columns:

cat key.cvs | sort -t, -k1,1 > sorted_keys.cvs
cat data.cvs | sort -t, -k3,3 > sorted_data.cvs

Now join the 2 files:

join -1 1 -2 3 -o "2.1 2.2 1.2 2.4" -t , sorted_keys.cvs sorted_data.cvs > replaced_data.cvs

This should produce:

snowball,dog,bart,1
frosty,yeti,homer,245
cujo,dog,maggie,4

This:

-o "2.1 2.2 1.2 2.4"

Is saying what columns from the 2 files you want in your final output.

It is pretty fast for finding and replacing multiple gigs of data compared to other scripting languages. I haven't done a direct comparison to SED/AWK, but it is much easier to write a bash script wrapping this than writing in SED/AWK (for me at least).

Also, you can speed up the sort by using an upgraded version of gnu coreutils so that you can do the sort in parallel

cat data.cvs | sort --parallel=4 -t, -k3,3 > sorted_data.cvs

4 being how many threads you want to run it in. I was recommended 2 threads per machine core will usually max out the machine, but if it is dedicated just for this, that is fine.

tunagami
  • 51
  • 2
1

There are several ways to do this. But if you want an easy one liner, without much in the way of validation I would go with an awk/sed solution.

Assume the following:

  1. the files are tab delimited

  2. you are using bash shell

  3. the id in the data file is in the first column

  4. your files look like this:

lookup

1   one
2   two
3   three
4   four
5   five

data

1   col2    col3    col4    col5
2   col2    col3    col4    col5
3   col2    col3    col4    col5
4   col2    col3    col4    col5
5   col2    col3    col4    col5

I would use awk and sed to accomplish this task like this:

awk '{print "sed -i s/^"$1"/"$2"/ data"}' lookup | bash

what this is doing is going through each line of lookup and writing the following to stdout

sed -i s/^1/one/ data

sed -i s/^2/two/ data

and so on.

it next pipes each line to the shell (| bash), which will execute the sed expression. -i for inplace, you may want -i.bak to create a backup file. note you can change the extension to whatever you would like. the sed is looking for the id at the start of the line, as indicated by the ^. You don't want to be replacing an 'id' in a column that might not contain an id.

your output would look like the following:

one     col2    col3    col4    col5
two     col2    col3    col4    col5
three   col2    col3    col4    col5
four    col2    col3    col4    col5
five    col2    col3    col4    col5

of course, your ids are probably not simply 1 to one, 2 to two, etc, but this might get you started in the right direction. And I use the term right very loosely.

matchew
  • 19,195
  • 5
  • 44
  • 48
  • 2
    This would invoke sed once for every row in the "millions of rows" the OP mentioned, creating a temp file each time, editing it and copying that back to the original. It would take a very, very long time to run. – Ed Morton Nov 27 '12 at 13:31
0

The way I'd do this is to use awk to write an awk program to process the larger file:

awk -f <(awk '
   BEGIN{print " BEGIN{"}
        {printf "      a[\"%s\"]=\"%s\";",$1,$2}
   END  {print "      }";
         print "      {$1=a[$1];print $0}"}
   ' lookup_file.txt
) data.txt

That assumes that the id column is column 1; if not, you need to change both instances of $1 in $1=a[$1]

rici
  • 234,347
  • 28
  • 237
  • 341
  • why would you do it that way instead of just doing it in one awk script? If you want to avoid the NR==FNR test performance impact (not worth avoiding IMHO) you could just read lookup_file.txt into an array with a getline loop in the BEGIN section. – Ed Morton Nov 27 '12 at 13:33
  • @EdMorton If it's one-shot, maybe it's not worth it, but most of the time it will prove useful to have the translation script kicking around without having to regenerate it. That makes it much easier to, for example, work on segments of the big file. In practice, I would probably make the last line of the generated script `{$1=a[$1]}` so that I can append processing with `--source` on the command line. – rici Nov 27 '12 at 15:27