36

I have a large dataset which I have to convert to .csv format, it consists of 29 columns and 1M+ lines. I figured that as the dataframe gets larger, appending any rows to it is getting more and more time consuming. I wonder if there is any faster way to this, sharing the relevant snippet from the code.

Any recommendations are welcome though.


df = DataFrame()

for startID in range(0, 100000, 1000):
    s1 = time.time()
    tempdf = DataFrame()
    url = f'https://******/products?startId={startID}&size=1000'

    r = requests.get(url, headers={'****-Token': 'xxxxxx', 'Merchant-Id': '****'})
    jsonList = r.json()  # datatype= list, contains= dict

    normalized = json_normalize(jsonList)
    # type(normal) = pandas.DataFrame
    print(startID / 1000) # status indicator
    for series in normalized.iterrows():  
        series = series[1] # iterrows returns tuple (index, series)
        offers = series['offers']
        series = series.drop(columns='offers')
        length = len(offers)

        for offer in offers:
            n = json_normalize(offer).squeeze()  # squeeze() casts DataFrame into Series
            concatinated = concat([series, n]).to_frame().transpose()
            tempdf = tempdf.append(concatinated, ignore_index=True)

    del normalized
    df = df.append(tempdf)
    f1 = time.time()
    print(f1 - s1, ' seconds')

df.to_csv('out.csv')
Erdal Dogan
  • 557
  • 1
  • 4
  • 10
  • 4
    Welcome to stack overflow :) What is your code trying to append? Show us some data. Show the expected output. It's time consuming to try to understand your entire code. It's easier to help when you provide good comprehensive examples. – Mohit Motwani Jul 12 '19 at 05:56
  • They are kind of confidential, so I cannot share any info. Let's say they contain information about prices of certain products from various e-commerce websites. It's basically a comparison tool. Each line in .csv file (rows in DataFrame) contains info such as product id, merchant, seller, price, delivery info etc. – Erdal Dogan Jul 12 '19 at 05:59
  • 1
    You don't have to show exact data. Just few example similar to your data set. Anything dummy that represents your problem. – Mohit Motwani Jul 12 '19 at 06:00
  • 3
    Anyway, if instead of making a dataframe for an entire loop, I would suggest build list of dictionaries. Each dictionary representing a row and and keys are the features. Then build the dataframe from this list. – Mohit Motwani Jul 12 '19 at 06:05
  • 1
    Also look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html – Mohit Motwani Jul 12 '19 at 06:05
  • @MohitMotwani thanks, those are some solid advice, I'll totally try them. this screenshot is most I can share but maybe it gives an idea. https://ibb.co/2PcFtqD – Erdal Dogan Jul 12 '19 at 06:14

1 Answers1

71

As Mohit Motwani suggested fastest way is to collect data into dictionary then load all into data frame. Below some speed measurements examples:

import pandas as pd
import numpy as np
import time
import random

end_value = 10000

Measurement for creating a list of dictionaries and at the end load all into data frame

start_time = time.time()
dictionary_list = []
for i in range(0, end_value, 1):
    dictionary_data = {k: random.random() for k in range(30)}
    dictionary_list.append(dictionary_data)

df_final = pd.DataFrame.from_dict(dictionary_list)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 0.090153 seconds

Measurements for appending data into list and concat into data frame:

start_time = time.time()
appended_data = []
for i in range(0, end_value, 1):
    data = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
    appended_data.append(data)

appended_data = pd.concat(appended_data, axis=0)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 4.183921 seconds

Measurements for appending data frames:

start_time = time.time()
df_final = pd.DataFrame()
for i in range(0, end_value, 1):
    df = pd.DataFrame(np.random.randint(0, 100, size=(1, 30)), columns=list('A'*30))
    df_final = df_final.append(df)

end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 11.085888 seconds

Measurements for insert data by usage of loc:

start_time = time.time()
df = pd.DataFrame(columns=list('A'*30))
for i in range(0, end_value, 1):
    df.loc[i] = list(np.random.randint(0, 100, size=30))


end_time = time.time()
print('Execution time = %.6f seconds' % (end_time-start_time))

Execution time = 21.029176 seconds

Zaraki Kenpachi
  • 5,510
  • 2
  • 15
  • 38