0

I am parsing line by line through a massive text file (~10M lines) by regex to filter and clean up what need.

Each matched.groupdict() returns {'col1:'...','col2:'...','col3:'...'} which I would like to collect into a DataFrame. Just like a database, each entry would had its own index.

Over the past few days, I did tons of research on SO, Pandas.DataFrame docs, Coursera on DataFrames and nothing worked. Most solutions suggest creating a list of my groupdict() and then create a DataFrame, but that takes too much memory and I need it to be more dynamic.

What should I do?

pattern = re.compile("(?P<col1>...)(?P<col2>...)(?P<col3>...)")
data = pd.DataFrame()
with open("massive.txt", 'r') as massive:
    for line in massive:
        matched = pattern.search(line)
        if(matched):
            data.append(matched.groupdict(), ignore_index=True)

data
Empty DataFrame
Columns: []
Index: []
Ken
  • 641
  • 3
  • 11
  • 25
  • 1
    `append` is not an inplace operation for DataFrames, so you need to reassign, i.e. `data = data.append(...)`. – root Mar 30 '17 at 17:58
  • 1
    So, did you look at the [documentation for `DataFrame.append`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html) ? Because it quite clearly states "Append rows of other to the end of this frame, **returning a new object**." As a rule of thumb, though, you can pretty much assume no `pandas` methods act (by default) in-place. – juanpa.arrivillaga Mar 30 '17 at 17:59
  • oh awkward. :D silly of me, I totally forgot to reassign. Thanks root and juanpa-arrivillaga :D – Ken Mar 30 '17 at 18:04

1 Answers1

3

... silly me

...
data = data.append(matched.groupdict(), ignore_index=True)
Ken
  • 641
  • 3
  • 11
  • 25