From How to get the pivot lines from two tab-separated files?, there's a quick way to use unix command to pivot lines from two files.
If we have two pairs of files:
f1a
andf1b
f2a
andf2b
The goal is to provide a 3 column tab-separated file, that comprises:
- f1a / f2a
- f1b
- f2b
Where f1a / f2a
are lines in the files that both occurs in f1a
and f1b
:
I've tried the following which works but if the file is extremely large, it will take significant amount of memory to store the f1
and f2
dictionary. E.g. files with billions of lines.
import sys
from tqdm import tqdm
f1a, f1b, f2a, f2b = sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4]
# Read first pair of file into memory.
with open(f1a) as fin_f1a, open(f1a) as fin_f1b:
f1 = {s.strip().replace('\t', ' ') :t.strip().replace('\t', ' ') for s, t in tqdm(zip(fin_f1a, fin_f1b))}
with open(s2) as fin_f2a, open(t2) as fin_f2b:
f2 = {s.strip().replace('\t', ' ') :t.strip().replace('\t', ' ') for s, t in tqdm(zip(fin_f2a, fin_f2b))}
with open('pivoted.tsv', 'w') as fout:
for s in tqdm(f1.keys() & f2.keys()):
print('\t'.join([s, f1[s], f2[s]]), end='\n', file=fout)
Is there a faster/better/easier way to achieve the same 3-columns tab-separated file in Python? Are there libraries that can do such operations efficiently for huge files?
Using turicreate.SFrame
, I could also do:
from turicreate import SFrame
f1a, f1b, f2a, f2b = sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4]
sf1a = SFrame.read_csv(f1a, delimited='\0', header=False)
sf1b = SFrame.read_csv(f1b, delimited='\0', header=False)
sf2a = SFrame.read_csv(f2a, delimited='\0', header=False)
sf2b = SFrame.read_csv(f2b, delimited='\0', header=False)
sf1 = sf1a.join(sf1b)
sf2 = sf2a.join(sf2b)
sf = sf1.join(sf2, on='X1', how='left')
sf.save('pivoted')