-2

I'm working with files that have ~400 million rows and 3 columns. The first two columns are alpha-numeric strings while the last is numeric. Something like this:

NM_001011874.1,NM_011441.4,-0.131672299779153

I've multiple such files with approximately the same number of rows and 3 columns. These aren't necessarily sorted based on any of the 3 columns. I'm trying to combine these files based on the combination of the first two columns. For instance:

File 1

NM_001011874.1,XR_104537.1,-0.929524370261122
NM_001011874.1,XM_003084433.1,-0.603098764428879

File 2

NM_001011874.1,XR_104537.1,-0.11254525414
NM_001011874.1,NM_005040.1,-0.20509876488

File 3

NM_001011874.1,XR_104537.1,-0.41254525414
NM_001011874.1,NM_005040.1,-0.60509876488

What I'm trying to do is create a key by using the combination of values in the first two columns and then retrieving the corresponding value from the third column for that pair. The final output I get is:

Output2 

NM_001011874.1,XR_104537.1,-0.11254525414,-0.929524370261122,-0.41254525414
NM_001011874.1,NM_005040.1,-0.20509876488,,-0.60509876488

I'm using awk to do the above:

awk -F',' 'NR==FNR{a[$1,$2]=$3;next}{$4=a[$1,$2];print}' OFS=',' file1.txt file2.txt

I'm allocating 256GB for the task. It takes about 90mins using the above command for producing the output by combining through two files where each file has ~400 million rows and 3 columns. The output file again has ~400 millions rows but 4 columns. Time taken to produce the output file increases with every added column.

I'm doing it sequentially, i.e. merge file1 and file2 to produce output1 with 4 columns. Then merge file3 and output1 to produce output2 with 5 columns, then file4 and output2 to produce output3 with 6 columns and so on until I've the final output with 22 columns.

I wonder if it would be more efficient in terms of speed and automation to do this in Python or Perl? I've about 20 such files with 3 columns each, although the rows vary from ~100 million to ~400 million. If you think I would be better off doing this in Python or Perl, can you please share an example to illustrate how the awk script translates to Python or Perl.

Edit: Added File 3 and the final output after that based on the comments.

wwii
  • 23,232
  • 7
  • 37
  • 77
gkandoi
  • 127
  • 3
  • Seems like something for Pandas or Numpy. – TigerhawkT3 Jan 26 '17 at 00:16
  • Of course it's better to have a script, for every purpose. Why don't you write one and time it? I'd expect it to be two orders of magnitude faster in a straight-up Perl script. – zdim Jan 26 '17 at 00:20
  • @zdim Why would you think it'd be faster in perl than awk? gkandoi what's the point of doing `$4=a[$1,$2];print` and forcing awk to recompile $0 instead of simply `print $0, a[$1,$2]`? – Ed Morton Jan 26 '17 at 00:35
  • @EdMorton No particular reason for perl vs awk -- I just don't see why simple processing on that amount of data would go for an hour and a half. Perhaps I'm misjudging something. I'd scale the estimate down to one order of magnitude faster. – zdim Jan 26 '17 at 00:38
  • @zdim I agree that seems like a ridiculuous duration. I can't imagine why it'd take that long but I suspect the OP is doing this in a shell loop and THAT is where the problem is coming from, nothing to do with the awk script he's calling repeatedly in that loop. gkandoi if you post some concise, testable sample input (you say you have more than 2 files so don't just post 2 files, post at least 3) and expected output then we can start trying to help you. – Ed Morton Jan 26 '17 at 00:39
  • 1
    I would throw it at a db. SQL or NoSQL, doesn't matter. – Tanktalus Jan 26 '17 at 00:45
  • @EdMorton If it involves shell that can easily explain it. I just wrote a 100-million-line file with a similar structure (2 mins, involving `rand`) and then processed it with a one-liner (Perl) of a comparable complexity -- just under 3 mins. – zdim Jan 26 '17 at 00:47
  • How fast is your disk? And what's the sum of the output file sizes? – Stefan Pochmann Jan 26 '17 at 00:52
  • @EdMorton I've added File 3 and the final output after using the awk one-liner. I picked up the script from a blog and used it as it was without thinking about the speed. It was only after I ran it through few files I thought it was taking very long. – gkandoi Jan 26 '17 at 00:59
  • @StefanPochmann I don't think it is the best of the clusters out there, but here is the detailed information on the cluster: http://hpcgroup.public.iastate.edu/HPC/lightning3/description.html – gkandoi Jan 26 '17 at 01:02

1 Answers1

7

When you have huge data files and you want to work with them efficiently, you're probably best off throwing them into a SQLite database, indexing them, and then querying them. See my answer about CSV vs SQLite performance for more on that.

Create a table for the data (stuff is a terrible name, but I don't know what this data is, so it's "stuff").

create table stuff (
    key1 text,
    key2 text,
    value real
);

Import your CSVs into the table using the SQLite shell.

sqlite> .mode csv
sqlite> .import file1 stuff
sqlite> .import file2 stuff
sqlite> .import file3 stuff

Create an index for the keys.

create index stuff_key on stuff (key1, key2);

Query to your heart's content.

select value
from stuff
where key1 = "NM_001011874.1" and
      key2 = "XR_104537.1"

-0.929524370261122
-0.11254525414
-0.41254525414

Once the import and indexing is done, it won't matter how large the data gets. Instead of updating the CSVs and reimporting all of them, you can import small CSV files with just the new fields. Or you can skip the CSVs and insert them directly.

insert into stuff (key1, key2, value)
values ("NM_204958293.2", "XR_29238498.3", -239.2),
       ("NM_2904892.3", "XR_3093.0", 9482.39);

I tested the performance of this, because I've advocated it a bunch of times, but haven't tested it.

First I cleared a bunch of disk space, because these files are going to get big. I'm doing this on a Macbook Pro i7 that was state of the art in 2011. Fortunately it has an after market SSD so I/O performance is excellent. It's no slouch, but it's not a top of the line server either. Point is, you don't need fancy hardware to get good performance.

Then I wrote a Perl program to generate 400 million rows of data, then while that was running I wrote a C program to do it faster. In a rare instance of program time being more important than programmer time for a one-off script, the C program finished first winding up with two roughly identical 14G files. They're a bit different, but it won't matter for our purposes.

Then I created the table and started the import. The initial import time isn't terribly important because I don't have to sit here and stare at it or baby it. I know it's going to work, I know I only have to do it once, so I can work on any number of things in parallel (like editing this post). Unfortunately, SQLite isn't working in parallel, it seems to be using only one core. OTOH it isn't using more than about 3 megs of memory.

Importing one 400 million row file took 20 minutes. The resulting SQLite database is about 17 gigs, so no great expansion of the data. I'm not going to do the rest because it rapidly becomes redundant.

Now I'm creating the index. Again, it's a one time thing that I don't have to sit around and watch... except I do because it's using a 1 gig of virtual memory, and the SQLite file is now 30 gigs. So... more deleting of files. Building the index took about 30 minutes.

50 minutes to import and index using 30 gigs of disk, about two times more than the original data. No programming necessary.

Community
  • 1
  • 1
Schwern
  • 153,029
  • 25
  • 195
  • 336
  • 1
    Thank you, it is helpful to be aware/reminded of this option. I am thinking, it's not just about a lot of data -- it's also good for its relational capabilities, that it makes available. – zdim Jan 27 '17 at 22:03