0

I am trying to build a dataframe having two columns from 50 csv files have 5000 rows and around 15 columns. When I try to run it without using the concat function, it used up a lot of memory and I got the kill error. Now I am chunking down the database and then concating the same. The only problem is that when I concat the chunks, it keeps the header for each chunk and when I print head() for the df, it provides me the head rows of only last chunk. And is there any other way to make my code run even faster, as I have read that using the concat function in for loop make it slower. My code goes like this:-

import os
import csv
import urllib.request as urllib
import datetime as dt
import pandas as pd
import pandas_datareader.data as web
import nsepy as nse

def saveNiftySymbols():
    url = "https://www.nseindia.com/content/indices/ind_nifty50list.csv"
# pretend to be a chrome 47 browser on a windows 10 machine
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
    req = urllib.Request(url, headers = headers)
# open the url 
    x = urllib.urlopen(req)
    sourceCode = x.read().decode('utf-8') 

    cr = csv.DictReader(sourceCode.splitlines())
    l = [row['Symbol'] for row in cr]
    return l

def symbolToPath(symbol, path='/Users/uditvashisht/Documents/udi_py/stocks/stock_dfs/'):
    return os.path.join(path,"{}.csv".format(str(symbol)))

def combinedNifty(l):
    mainDf=pd.DataFrame()

    for symbol in l:
        chunks=pd.read_csv(symbolToPath(symbol),chunksize=10,usecols=['Date','Close'],index_col='Date',parse_dates=True)
        df=pd.DataFrame()
        for chunk in chunks:
            df=pd.concat([chunk])

            df.rename(columns={'Close':symbol}, inplace=True)


        if mainDf.empty:
            mainDf = df
        else:
            mainDf = mainDf.join(df, how='outer')

    print(mainDf.head())
    mainDf.to_csv('combinedNifty.csv')


combinedNifty(saveNiftySymbols())

1 Answers1

0

The only problem is that when I concat the chunks, it keeps the header for each chunk and when I print head() for the df, it provides me the head rows of only last chunk

This is because what is actually happening is you only have the last chunk in your df. When you run this line:

df=pd.concat([chunk])

You are actually redefining df by concatenating only your current chunk with nothing else. It is effectively as if you were doing:

For chunk in chunks:
    df = chunk

That is why when calling the head() method you are only seeing that last chunk. Instead, you don't need a for loop to concatenate chunks. Concat takes a list of dataframes as the argument and concatenates them together, so you only need to do:

df = pd.concat(chunks)

This should also improve performance, because it is better to concat once with a list of many dataframes instead of doing something like df = pd.concat([df, chunk]) in a for loop, which is probably what you intended to do in your original code.

Steven Walton
  • 406
  • 4
  • 7
  • Hi, I used your method, it helped me with getting the full data frame for individual symbol in the list,instead of chunks. But still if I am using it with the above code to create dataframe from 50 csv files and then joining them. I am getting error killed:9. – Udit Hari Vashisht Oct 10 '17 at 16:29
  • Try the approach given [here](https://stackoverflow.com/questions/38089010/merge-a-list-of-pandas-dataframes) to use `pd.merge` instead of join in order to merge all your dataframes in one go outside of the for loop instead of calling df.join repeatedly. I'm not sure if join has the same issue as concat, but might improve things.Make an empty list and append your `df` to that list on each `for symbol in l:` loop, then merge at the end. Also check [this method](https://stackoverflow.com/questions/17557074/memory-error-when-using-pandas-read-csv) for reducing memory errors when reading csv's. – Steven Walton Oct 10 '17 at 17:42
  • Also, I am not sure why you are using such small chunk sizes to process the csv. See [this post](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) on processing large csv's with chunks. The comments also describe the advantage of calling concat on a list instead of with a for loop. – Steven Walton Oct 10 '17 at 17:47
  • Actually, my original code gave me error "Killed:9" after running for hours. So, i decided to chunk the csv files. but apparently I have found that the code is eating up the memory at the time of joining the 50 dataframes. I have edited my code as under, but I am not being able to concat the dataframes horizontally across the Date column.:- – Udit Hari Vashisht Oct 11 '17 at 02:54
  • def combinedNifty(l): mainDf=pd.DataFrame() for symbol in l: dflist=[] chunks=pd.read_csv(symbolToPath(symbol),chunksize=10,usecols=['Date','Close'],index_col='Date',parse_dates=True) df=pd.DataFrame() df=pd.concat(chunks) df.rename(columns={'Close':symbol}, inplace=True) del chunks dflist.append(df) #print(dflist) mainDf=pd.concat(dflist,axis=1,join='outer') print(mainDf.head()) mainDf.to_csv('combinedNifty.csv') – Udit Hari Vashisht Oct 11 '17 at 02:54
  • Here's some things to try: - Use infer_datetime_format = True in read_csv, it can improve the time to read csv's significantly for large files – Steven Walton Oct 11 '17 at 04:30