Creating a large dataset in python line-by-line

Question

For my graduate thesis I need to create a dataset of poker actions to test models with. I wrote a function that reads a text file with information about the hand and returns a list, which I append to a pandas data frame.

I have about 1500 files and each of them contains 1500~3000 hands that need to be passed to this function, so my main script is looking something like this.

import os
os.chdir("C:/Users/jctda/OneDrive/Documentos/TCC/Programa")

import pandas as pd
from datagen import DataGenerator, EmptyLine
from poker.room.pokerstars import PokerStarsHandHistory
from functions import FindFold, GetFiles, GetShowers
#IMPORT DATAGEN AQUI

database = pd.DataFrame()

files = GetFiles('hand_texts')
for hand_text in files:
    text=open('hand_texts/' + hand_text)
    b=text.read()
    hands=b.split("\n\n\n\n\n")
    text.close()

    for i in range(1,len(hands)):

        try:

            hh = PokerStarsHandHistory(unicode(hands[i]))
            hh.parse()
            fold = FindFold(hh)

            if fold == 'showdown':
                for shower in GetShowers(hh):
                    database = database.append(DataGenerator(hh,shower,hand_text,i))
                    print('Success in parsing iteration ' + str(i) + ' from file' + hand_text)

        except:

            print('PARSER ERROR ON ITERATION [[' + str(i) + ']] FROM FILE [[' + hand_text + ']]')
            database = database.append(EmptyLine(hand_text,i))




database.to_csv('database2.csv')

The problem is that after a few hours running it becomes very slow. The first file takes about 20 seconds, but they get slower each time and after 8h running they start to take more than an hour each. I just started learning python for this project so I'm probably making a big mistake somewhere and causing it to take much more time than needed, but I can't find it.

Another thing that's been bugging me is that it consumes less than 1GB of RAM while it's running on a machine with 16GB. I thought about trying to force it to use more memmory but apparently there isn't a memmory limit on python, so I guess it's just bad code

Can someone help me figure out what to do?

membranepotential · Accepted Answer · 2018-03-03T08:06:10.983

As described in here, do not append to a dataframe inside a loop, as it is very inefficent. Rather do something like this:

files = GetFiles('hand_texts')

database = []
for hand_text in files:
    # as a sidenote, with contexts are helpful for these:
    with open('hand_texts/' + hand_text) as text:
        b=text.read()

    hands=b.split("\n\n\n\n\n")

    for i in range(1,len(hands)):
        try:
            hh = PokerStarsHandHistory(unicode(hands[i]))
            hh.parse()
            fold = FindFold(hh)

            if fold == 'showdown':
                for shower in GetShowers(hh): 
                    database.append(DataGenerator(hh,shower,hand_text,i))
                    print('Success in parsing iteration ' + str(i) + ' from file' + hand_text)

        except:
            print('PARSER ERROR ON ITERATION [[' + str(i) + ']] FROM FILE [[' + hand_text + ']]')
            database.append(EmptyLine(hand_text,i))

pd.DataFrame(database).to_csv('database2.csv'))

Thanks a lot! But apparently with lists you just need to use `database.append()` instead of `database = database.append()` as the second one returns `None` — Fino, Mar 03 '18 at 00:42

score 2 · Answer 2 · answered Mar 02 '18 at 23:10

Someone correct me if I'm wrong but I believe appending to a dataframe involves iterating through the entire dataframe. That's why it takes longer as the dataframe gets longer. I believe that appending to a file does not involve reading in the entire file each time. Try this:

with open('database2.csv', 'wa') as file: # 'wa' is write append mode
    file.write(relevant_data)

This will also automatically close the file at the end of the indented block.

Also, it sounds like you think using more RAM automatically makes your program faster. This is not true. Oftentimes you can make tradeoffs that involve faster run-time with more RAM usage, but the same block of code on the same machine will always take almost exactly the same amount of time and RAM to run.

Creating a large dataset in python line-by-line

2 Answers2