Writing over 1 milion records to csv in python

Question

I am extracting some data to a csv file using python, the data is over 1 million records. Definitely there seems to be memory issues with my script because after a painstaking 5 hours and roughly over 190k records written the scripts running process gets killed.

here is my terminal

(.venv)[cv1@mdecv01 maidea]$ python common_scripts/script_tests/ben-test-extract.py BEN
Generating CSV file. Please wait ...
Preparing to write file: BEN-data-20170731.csv
Killed
(.venv)[cv1@mdecv01 maidea]$

is their a way i can extract this data with proper memory management?

here is my script

Can you operate anything on `Beneficiary.objects.all()`? Try a print or smth. Else, in case of memory issues in for loop, try to use generators i.e. [yield](https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do#231855) — mquantin, Jul 31 '17 at 15:37
maybe posting your code (or a shortened version) _in the question_ — Jean-François Fabre, Jul 31 '17 at 15:37

score 1 · Answer 1 · answered Jul 31 '17 at 15:52

You are not taking advantage of select_related or prefetch_related. If you do not use these two methods you will end up performing database calls every time you access a related field (ForeignKey, ManyToManyField)

for beneficiary in Beneficiary.objects.all():
    if beneficiary.is_active:
        household = beneficiary.household
        if len(beneficiary.enrolments) > 0 and len(beneficiary.interventions) > 1:

Should be something like this

for beneficiary in Beneficiary.objects.select_related(
    'household'
).prefetch_related(
    'enrolments',
    'interventions'
):
    if beneficiary.is_active:
        household = beneficiary.household
        if len(beneficiary.enrolments.all()) > 0 and len(beneficiary.interventions.all()) > 1:

iklinac · Answer 2 · 2017-07-31T16:10:23.003

0

Filter in queryset instead of pulling all data for example .filter(is_active=true) , filter by count for example annotate(interventions_count=Count('interventions')).filter(interventions_count__gte=1)
Pull data in iterations with offset and limit other than pulling it all at once [from (smaller memory consuption) [0:100]
Make use of select_related and prefetch_related to pre-select tables you need

edited Jul 31 '17 at 16:10

answered Jul 31 '17 at 16:05

iklinac

14,944
4
28
30

Writing over 1 milion records to csv in python

2 Answers2