I am parsing line by line through a massive text file (~10M lines) by regex to filter and clean up what need.
Each matched.groupdict()
returns {'col1:'...','col2:'...','col3:'...'}
which I would like to collect into a DataFrame. Just like a database, each entry would had its own index.
Over the past few days, I did tons of research on SO, Pandas.DataFrame docs, Coursera on DataFrames and nothing worked. Most solutions suggest creating a list of my groupdict()
and then create a DataFrame, but that takes too much memory and I need it to be more dynamic.
What should I do?
pattern = re.compile("(?P<col1>...)(?P<col2>...)(?P<col3>...)")
data = pd.DataFrame()
with open("massive.txt", 'r') as massive:
for line in massive:
matched = pattern.search(line)
if(matched):
data.append(matched.groupdict(), ignore_index=True)
data
Empty DataFrame
Columns: []
Index: []