1

I have a very large file (~5 million lines) containing numbers.

numbers.txt:

1
5
1
4
2
20
1
...

I have another file containing data (~1million lines).

data.txt:

1.000000 -1.072000 -1.000000
2.000000 -1.213000 1.009900
-1.210000 -1.043000 1.000000
-1.000000 -1.000000 -1.000000
1.000000 1.000000 -0.999999
...

The numbers.txt contains line numbers for the data.txt file. I need to output a file that is the numbers.txt replaced with the corresponding line from data.txt. So for the above example the output would look like:

1.000000 -1.072000 -1.000000
1.000000 1.000000 -0.999999
1.000000 -1.072000 -1.000000
-1.000000 -1.000000 -1.000000
2.000000 -1.213000 1.009900
...

I think awk would be the right way to go, but I'm unable to figure out how to do it.

There are two caveats:

  • Files are very large, so reading everything into memory is not an option.
  • The file has to retain its order. Sorting is not an option.

I did find this question, but it doesn't satisfy the caveats.

Community
  • 1
  • 1
Capstone
  • 2,254
  • 2
  • 20
  • 39
  • Would you accept an answer using Python's `linecache` module? It would seem wise to cache lines given the line numbers can repeat. – kojiro Feb 18 '14 at 18:55
  • @kojiro Yes. As long as I can run it from the shell and have the output file created in a reasonable amount of time. Although, I have 0 experience with Python. – Capstone Feb 18 '14 at 19:02

2 Answers2

2

This is pretty much what Python's linecache module was built for:

#!/usr/bin/env python

from linecache import getline

with open('numbers.txt') as lines:
  for line in lines: # Read each line from the lines file
    try:
      print getline('data.txt', int(line)) # Attempt to get and print that line from the data file
    except ValueError:
      pass # line did not contain a numeral, so ignore it.

You can do this as a oneliner, as well:

python -c 'import linecache;print "\n".join(linecache.getline("data.txt", int(line)) for line in open("numbers.txt"))'
kojiro
  • 74,557
  • 19
  • 143
  • 201
  • This works really well. Thanks! However kuroi's answer does a slightly better job. I ran both solutions multiple times and awk is marginally faster. Memory consumed by both peaks at around 150M. Some stats: Your solution: real 1m48.343s user 2m8.374s sys 0m43.307s Kuroi's solution: real 1m40.651s user 2m1.133s sys 0m42.438s For a file around 525MB in size. Upvoted though. Thanks again. – Capstone Feb 18 '14 at 21:12
  • Marginal, indeed. Is it a race? Try preloading the line numbers for a little boost: `python -c 'import linecache;print "\n".join(linecache.getline("data.txt", line) for line in map(int, open("numbers.txt").read().split()))'` :) – kojiro Feb 18 '14 at 22:14
1

Only the data file has to be retained in memory, so the index file can be of an arbitrary size.

If your data file is 1 million lines of about 40 characters, it should fit in 40 Mb, which is a breeze for your average PC.

Re-opening the data file to fetch one line at a time would be way slower, even with disk caching.

So I think you could safely go for a solution that would fetch the entire data file into memory.

Here is how I would do it in awk:

gawk "{if(NR==FNR)l[NR]=$0; else print l[$1] }" data.txt numbers.txt

With this input

data.txt

1 1.000000 -1.072000 -1.000000
2 2.000000 -1.213000 1.009900
3 -1.210000 -1.043000 1.000000
4 -1.000000 -1.000000 -1.000000
5 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
6 2.000000 -1.213000 1.009900
7 -1.210000 -1.043000 1.000000
8 -1.000000 -1.000000 -1.000000
9 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
10 2.000000 -1.213000 1.009900
11 -1.210000 -1.043000 1.000000
12 -1.000000 -1.000000 -1.000000
13 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
14 2.000000 -1.213000 1.009900
15 -1.210000 -1.043000 1.000000
16 -1.000000 -1.000000 -1.000000
17 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
18 2.000000 -1.213000 1.009900
19 -1.210000 -1.043000 1.000000
20 -1.000000 -1.000000 -1.000000

(I added an index in front of your sample data for testing).

numbers.txt

1
5
1
4
2
20
1

it produces

1 1.000000 -1.072000 -1.000000
5 1.000000 1.000000 -0.9999991.000000 -1.072000 -1.000000
1 1.000000 -1.072000 -1.000000
4 -1.000000 -1.000000 -1.000000
2 2.000000 -1.213000 1.009900
20 -1.000000 -1.000000 -1.000000
1 1.000000 -1.072000 -1.000000

Performance test

I used this PHP script to generate a test case:

<?php
$MAX_DATA  = 1000000;
$MAX_INDEX = 5000000;

$contents = "";
for ($i = 0 ; $i != $MAX_DATA ; $i++) $contents .= ($i+1) . " " . str_shuffle("01234567890123456789012345678901234567890123456789") . "\n";
file_put_contents ('data.txt', $contents);

$contents = "";
for ($i = 0 ; $i != $MAX_INDEX ; $i++) $contents .= rand(1, $MAX_DATA) . "\n";
file_put_contents ('numbers.txt', $contents);

echo "done.";
?>

With a random input of 1M data and 5M indexes, the awk script above took about 20 seconds to produce a result on my PC.
The data file was about 56 Mb and the awk process consumed about 197 mb.

As one could have expected, the processing time is roughly proportional to the size of the index file for a given set of data.

kuroi neko
  • 8,479
  • 1
  • 19
  • 43
  • Much thanks for this. See comment on kojiro's answer for stats if you're interested. Thanks again. – Capstone Feb 18 '14 at 21:15
  • LOL that's what I call a benchmark. I would like to have a PC as fast as yours! I guess python and awk are doing basically the same thing, only awk is a more compact and specialized tool, that is probably why it gets a slight bit ahead. – kuroi neko Feb 18 '14 at 23:13