2

I have two files as follows:

file1:

3 1
2 4
2 1

file2:

23
9
7
45

The second field of file1 is used to specify the line of file2 that contains the number to be retrieved and printed. In the desired output, the first field of file1 is printed and then the retrieved field is printed.

Desired output file:

3 23
2 45
2 23

Here is my attempt to solve this problem:

IFS=$'\r\n' baf2=($(cat file2));echo;awk -v av="${baf2[*]}"  'BEGIN {split(av, aaf2, / /)}{print $1, aaf2[$2]}' file1;echo;echo ${baf2[*]}

However, this script cannot use the Bash array baf2.

The solution must be efficient since file1 has billions of lines and file2 has millions of lines in the real case.

Kadir
  • 1,345
  • 3
  • 15
  • 25
  • This may point you in the right direction: http://stackoverflow.com/questions/6022384/bash-tool-to-get-nth-line-from-a-file – Ákos Feb 19 '14 at 08:13

3 Answers3

1

You can use this awk

awk 'FNR==NR {a[NR]=$1;next} {print $1,a[$2]}' file2 file1
3 23
2 45
2 23

Sorte file2 in array a.
Then print field 1 from file1 and use field 2 to look up in array.

Jotne
  • 40,548
  • 12
  • 51
  • 55
  • Dear @Jotne, what is run-time complexity of this solution where the number of lines in `file1` is `m` and the number of lines in `file2` is `n`? It should not be `O(mn)` since `m` and `n` are very large. – Kadir Feb 19 '14 at 08:15
  • I have no idea on how long it takes, but you can type `time awk 'FNR...` and see how long things takes. – Jotne Feb 19 '14 at 08:23
  • Dear @Jotne, after trying your solution, I have realized that the MWE that I gave did not show all properties of my real data set. The first fields of `file1` may be same. – Kadir Feb 19 '14 at 08:42
  • @Kadir I do not know what MWE is and if data is different, then updated your post, or if its much different, create a new one. – Jotne Feb 19 '14 at 09:32
1

This has a similar basis to Jotne's solution, but loads file2 into memory first (since it is smaller than file1):

awk 'FNR==NR{x[FNR]=$0;next}{print $1 FS x[$2]}' file2 file1

Explanation

The FNR==NR part means that the part that follows in curly braces is only executed when reading file2, not file1. As each line of file2 is read, it is saved in array x[] as indexed by the current line number. The part in the second set of curly braces is executed for every line of file1 and it prints the first field on the line followed by the field separator (space) followed by the entry in x[] as indexed by the second field on the line.

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
1

Using awk

1) print all lines in file1, whatever if there is match or not

awk 'NR==FNR{a[NR]=$1;next}{print $1,a[$2]}' file2 file1

2) print match lines only

awk 'NR==FNR{a[NR]=$1;next}$2=a[$2]' file2 file1
BMW
  • 42,880
  • 12
  • 99
  • 116