Python) How to reduce runtime of big datasets parsing

Question

I'm gonna read, parse and integrate two huge text files as input and then create new file.
There are also extra another file which is used for this parsing.
Briefly explaining, two text files have about 100 millions of rows and three columns.
First, read two different files and write matched two values into new files.
If there is no matched value from one of input files, 0.0 will be inserted into the matrix of each row.
For boosting the efficiency of this parsing, I made another input file which is union file about 1st column (key) from two text files as follows.
I tested this code with small input files (10000 of rows). It worked well. I started running this code with huge big datasets two days before, unfortunately it is still running.
How to reduce the running time and parse it efficiently?

1st_infile.txt

MARCH2_MARCH2   2.3 0.1
MARCH2_MARC2    -0.2     0
MARCH2_MARCH5   -0.3    0.3
MARCH2_MARCH6   -1.4    0
MARCH2_MARCH7   0.1 0
MARCH2_SEPT2    -1.0    0
MARCH2_SEPT4    0.8 0

2nd_infile.txt

MARCH2_MARCH2    2.2    0
MARCH2_MARCH2.1  0.2    0
MARCH2_MARCH3   -0.4    0
MARCH2_MARCH5   -0.3    0
MARCH2_MARCH6   -0.6    0
MARCH2_MARCH7    1.2    0
MARCH2_SEPT2     0.2    0

union_file.txt

MARCH2_MARCH2   
MARCH2_MARCH2.1
MARCH2_MARC2
MARCH2_MARCH5   
MARCH2_MARCH6   
MARCH2_MARCH7
MARCH2_SEPT2    
MARCH2_SEPT4
MARCH2_MARCH3

Outfile.txt

MARCH2_MARCH2   2.3   0.1   2.2     0
MARCH2_MARCH2.1     0.0   0.0   0.2     0
MARCH2_MARC2       -0.2   0     0.0     0.0
MARCH2_MARCH5      -0.3   0.3   -0.3    0
MARCH2_MARCH6      -1.4   0     -0.6    0
MARCH2_MARCH7       1.2   0     1.2     0
MARCH2_SEPT2       -1.0   0     0.2     0
MARCH2_SEPT4        0.8   0     0.0     0.0
MARCH2_MARCH3       0.0   0.0  -0.4     0

Python.py

def load(filename):
    ret = {}
    with open(filename) as f:
        for lineno, line in enumerate(f, 1):
            try:
                name, value1, value2 = line.split()
            except ValueError:
                print('Skip invalid line {}:{}L {0!r}'.format(filename, lineno, line))
                continue
            ret[name] = value1, value2
    return ret

a, b = load('1st_infile.txt'), load('2nd_infile.txt')

with  open ('Union_file.txt') as f:
        with  open('Outfile.txt', 'w') as fout: 
            for line in f:
                    name = line.strip()
                    fout.write('{0:<20} {1[0]:>5} {1[1]:>5} {2[0]:>5} {2[1]:>5}\n'.format(
                            name,
                            a.get(name, (0, 0)),
                            b.get(name, (0, 0))
                    ))

score 0 · Answer 1 · edited May 23 '17 at 10:24

0

You should try to use a streaming read instead of reading the entire file at once. (Lazy read)

You can find a working example of the Lazy reader here: Lazy Method for Reading Big File in Python?

edited May 23 '17 at 10:24

Community

1
1

answered Nov 02 '13 at 07:26

Madhavan Malolan

719
6
24

Luyi Tian · Accepted Answer · 2013-11-02T07:58:19.120

since 1st_infile.txt and 2nd_infile.txt are highly related, why not try parsing the two files together, and use a singal result_dict to store all info instead of two dict? the script is like:

 result_dict={}
 f1=open(1st..)
 f2=open(2st..)
 line1=f1.readline()
 line2=f2.readline()
 while (...):
   name1,val11 , val12 = line1.split()
   result_dict.setdefault(name1,[0.]*4)[0],result_dict.\
   setdefault(name1,[0.]*4)[1] =val11,val12
   name2, val21, val22 = line1.split()
   result_dict.setdefault(name2,[0.]*4)[2],result_dict.\
   setdefault(name2,[0.]*4)[3] =val21,val22
   line1=f1.readline()
   line2=f2.readline()
   ....
 ....
 f1.close()
 f2.close()

note: it is just a brief illustration, not runnable code.

there are several ways to read line from large file: 1. for line in file: dosth 2. file.next() use iterate generator (not recommanded) 3. file.readline() or file.readlines(1000) which read 1000 lines per time. also you can read byte from file, just as the Lazy Method

Python) How to reduce runtime of big datasets parsing

2 Answers2