I'm gonna read, parse and integrate two huge text files as input and then create new file.
There are also extra another file which is used for this parsing.
Briefly explaining, two text files have about 100 millions of rows and three columns.
First, read two different files and write matched two values into new files.
If there is no matched value from one of input files, 0.0 will be inserted into the matrix of each row.
For boosting the efficiency of this parsing, I made another input file which is union file about 1st column (key) from two text files as follows.
I tested this code with small input files (10000 of rows). It worked well. I started running this code with huge big datasets two days before, unfortunately it is still running.
How to reduce the running time and parse it efficiently?
1st_infile.txt
MARCH2_MARCH2 2.3 0.1
MARCH2_MARC2 -0.2 0
MARCH2_MARCH5 -0.3 0.3
MARCH2_MARCH6 -1.4 0
MARCH2_MARCH7 0.1 0
MARCH2_SEPT2 -1.0 0
MARCH2_SEPT4 0.8 0
2nd_infile.txt
MARCH2_MARCH2 2.2 0
MARCH2_MARCH2.1 0.2 0
MARCH2_MARCH3 -0.4 0
MARCH2_MARCH5 -0.3 0
MARCH2_MARCH6 -0.6 0
MARCH2_MARCH7 1.2 0
MARCH2_SEPT2 0.2 0
union_file.txt
MARCH2_MARCH2
MARCH2_MARCH2.1
MARCH2_MARC2
MARCH2_MARCH5
MARCH2_MARCH6
MARCH2_MARCH7
MARCH2_SEPT2
MARCH2_SEPT4
MARCH2_MARCH3
Outfile.txt
MARCH2_MARCH2 2.3 0.1 2.2 0
MARCH2_MARCH2.1 0.0 0.0 0.2 0
MARCH2_MARC2 -0.2 0 0.0 0.0
MARCH2_MARCH5 -0.3 0.3 -0.3 0
MARCH2_MARCH6 -1.4 0 -0.6 0
MARCH2_MARCH7 1.2 0 1.2 0
MARCH2_SEPT2 -1.0 0 0.2 0
MARCH2_SEPT4 0.8 0 0.0 0.0
MARCH2_MARCH3 0.0 0.0 -0.4 0
Python.py
def load(filename):
ret = {}
with open(filename) as f:
for lineno, line in enumerate(f, 1):
try:
name, value1, value2 = line.split()
except ValueError:
print('Skip invalid line {}:{}L {0!r}'.format(filename, lineno, line))
continue
ret[name] = value1, value2
return ret
a, b = load('1st_infile.txt'), load('2nd_infile.txt')
with open ('Union_file.txt') as f:
with open('Outfile.txt', 'w') as fout:
for line in f:
name = line.strip()
fout.write('{0:<20} {1[0]:>5} {1[1]:>5} {2[0]:>5} {2[1]:>5}\n'.format(
name,
a.get(name, (0, 0)),
b.get(name, (0, 0))
))