Pandas taking too long and consuming too much memory when working with excel file

Question

Am trying to work on an excel sheet with less than 50k rows. What am trying to do is - using a particular column, I want to get all unique values and then by using the unique values, I would like to get all rows that contain that value, and put them in this format

[{
"unique_field_value": [Array containing row data that match the unique value as dictionaries]
},]

The thing is when I test with less rows like 1000 rows all goes well. As the number grows, memory usage also increases until it can't grow no more and my PC freezes. So, is there something am not doing right with pandas? .Here are details of my platform:

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS"
NAME="Ubuntu"
VERSION="16.04.3 LTS (Xenial Xerus)"
ID_LIKE=debian
VERSION_ID="16.04"

Here is my code am am running on Jupyter Notebook

import pandas as pd
import simplejson
import datetime

def datetime_handler(x):
    if isinstance(x, datetime.datetime):
        return x.isoformat()
    raise TypeError("Type not Known")

path = "/home/misachi/Downloads/new members/my_file.xls"
df = pd.read_excel(path, index_col=None, skiprows=[0])
df = df.dropna(thresh=5)
df2 = df.drop_duplicates(subset=['corporate'])

schemes = df2['corporate'].values

result_list = []
result_dict = {}

for count, name in enumerate(schemes):
    inner_dict = {}
    col_val = schemes[count]
    foo = df['corporate'] == col_val
    data = df[foo].to_json(orient='records', date_format='iso')
    result_dict[name] = simplejson.loads(data)
    result_list.append(result_dict)
#     print(result_list)
#     if count == 3:
#         break

dumped = simplejson.dumps(result_list, ignore_nan=True, default=datetime_handler)

with open('/home/misachi/Downloads/new members/members/folder/insurance.json', 'w') as json_f:
    json_f.write(dumped)

EDIT

Here is the sample output expected

[{
    "TABBY MEMORIAL CATHEDRAL": [{
        "corp_id": 8494,
        "smart": null,
        "copay": null,
        "corporate": "TABBY MEMORIAL CATHEDRAL",
        "category": "CAT A",
        "member_names": "Brian Maombi",
        "member_no": "84984",
        "start_date": "2017-03-01T00:00:00.000Z",
        "end_date": "2018-02-28T00:00:00.000Z",
        "outpatient": "OUTPATIENT"
    }, {
        "corp_id": 8494,
        "smart": null,
        "copay": null,
        "corporate": "TABBY MEMORIAL CATHEDRAL",
        "category": "CAT A",
        "member_names": "Omula Peter",
        "member_no": "4784984",
        "start_date": "2017-03-01T00:00:00.000Z",
        "end_date": "2018-02-28T00:00:00.000Z",
        "outpatient": "OUTPATIENT"
    }],
    "CHECKIFY KENYA LTD": [{
        "corp_id": 7489,
        "smart": "SMART",
        "copay": null,
        "corporate": "CHECKIFY KENYA LTD",
        "category": "CAT A",
        "member_names": "BENARD KONYI",
        "member_no": "ABB/8439",
        "start_date": "2017-08-01T00:00:00.000Z",
        "end_date": "2018-07-31T00:00:00.000Z",
        "outpatient": "OUTPATIENT"
    }, {
        "corp_id": 7489,
        "smart": "SMART",
        "copay": null,
        "corporate": "CHECKIFY KENYA LTD",
        "category": "CAT A",
        "member_names": "KEVIN WACHAI",
        "member_no": "ABB/67484",
        "start_date": "2017-08-01T00:00:00.000Z",
        "end_date": "2018-07-31T00:00:00.000Z",
        "outpatient": "OUTPATIENT"
    }]
}]

The complete and cleaner code is:

import os
import pandas as pd
import simplejson
import datetime


def datetime_handler(x):
    if isinstance(x, datetime.datetime):
        return x.isoformat()
    raise TypeError("Unknown type")


def work_on_data(filename):
    if not os.path.isfile(filename):
        raise IOError
    df = pd.read_excel(filename, index_col=None, skiprows=[0])
    df = df.dropna(thresh=5)

    result_list = [{n: g.to_dict('records')} for n, g in df.groupby('corporate')]

    dumped = simplejson.dumps(result_list, ignore_nan=True, default=datetime_handler)
    return dumped
dumped = work_on_data('/home/misachi/Downloads/new members/my_file.xls')
with open('/home/misachi/Downloads/new members/members/folder/insurance.json', 'w') as json_f:
    json_f.write(dumped)

piRSquared · Answer 1 · 2017-09-06T21:54:32.080

1

Get the dictionary with

result_dict = [{n: g.to_dict('records') for n, g in df.groupby('corporate')}]

edited Sep 06 '17 at 21:54

answered Sep 06 '17 at 19:13

piRSquared

285,575
57
475
624

This works, its faster and more efficient and even more cleaner. But it does not return the data in the specified format i.e [{key: val}] where key is the unique field name and val is list of dicts containing data for all rows that have the same unique field value. – Misachi Sep 06 '17 at 21:16
I suggest you produce an actual example and show the desired output such that I don't have to guess what you are attempting to do. – piRSquared Sep 06 '17 at 21:19
According to your edit, you have a length one list with its first and only element equal to the same dictionary I gave you. The only modification I made was to put my previous solution inside square brackets. – piRSquared Sep 06 '17 at 21:51
could you know the reason why my initial code was so slow? – Misachi Sep 06 '17 at 22:34
1

You were looping through each unique value and splitting the dataframe each time. Also, in every iteration you dumped to json only to read it again. – piRSquared Sep 06 '17 at 22:36

score -1 · Answer 2 · answered Sep 06 '17 at 19:15

-1

Specify chunksize=10000 parameter with read_excel() and loop through the file until you reach the end of data. This will help you to manage memory when working with large files. If you have multiple sheets to manage follow this example

    for chunk in pd.read_excel(path, index_col=None, skiprows=[0] chunksize=10000):
        df = chunk.dropna(thresh=5)
        df2 = df.drop_duplicates(subset=['corporate'])
        # rest of your code

answered Sep 06 '17 at 19:15

Shijo

9,313
3
19
31

Specifying chunksize throws a NotImplementedError. The documentation also does not include it. – Misachi Sep 06 '17 at 19:33

Pandas taking too long and consuming too much memory when working with excel file

2 Answers2