I have a number of text files, say 50, that I need to read into a massive dataframe. At the moment, I am using the following steps.
- Read every file and check what the labels are. The information I need is often contained in the first few lines. The same labels just repeat for the rest of the file, with different types of data listed against them each time.
- Create a dataframe with those labels.
- Read the file again and fill the dataframe with values.
- Concatenate that dataframe with a master dataframe.
This works pretty well for files that are of the 100 KB size - a few minutes, but at 50 MB, it just takes hours, and is not practical.
How can I optimise my code? In particular -
- How can I identify what functions are taking the most time, which I need to optimise? Is it the reading of the file? Is it the writing to the dataframe? Where is my program spending time?
- Should I consider multithreading or multiprocessing?
- Can I improve the algorithm?
- Perhaps read the entire file in in one go into a list, rather than line by line,
- Parse data in chunks/entire file, rather than line by line,
- Assign data to the dataframe in chunks/one go, rather than row by row.
- Is there anything else that I can do to make my code execute faster?
Here is an example code. My own code is a little more complex, as the text files are more complex such that I have to use about 10 regular expressions and multiple while loops to read the data in and allocate it to the right location in the right array. To keep the MWE simple, I haven't used repeating labels in the input files for the MWE either, so it would like I'm reading the file twice for no reason. I hope that makes sense!
import re
import pandas as pd
df = pd.DataFrame()
paths = ["../gitignore/test1.txt", "../gitignore/test2.txt"]
reg_ex = re.compile('^(.+) (.+)\n')
# read all files to determine what indices are available
for path in paths:
file_obj = open(path, 'r')
print file_obj.readlines()
['a 1\n', 'b 2\n', 'end']
['c 3\n', 'd 4\n', 'end']
indices = []
for path in paths:
index = []
with open(path, 'r') as file_obj:
line = True
while line:
try:
line = file_obj.readline()
match = reg_ex.match(line)
index += match.group(1)
except AttributeError:
pass
indices.append(index)
# read files again and put data into a master dataframe
for path, index in zip(paths, indices):
subset_df = pd.DataFrame(index=index, columns=["Number"])
with open(path, 'r') as file_obj:
line = True
while line:
try:
line = file_obj.readline()
match = reg_ex.match(line)
subset_df.loc[[match.group(1)]] = match.group(2)
except AttributeError:
pass
df = pd.concat([df, subset_df]).sort_index()
print df
Number
a 1
b 2
c 3
d 4
My input files:
test1.txt
a 1
b 2
end
test2.txt
c 3
d 4
end