Creation of formatted DataFrame and then adding data line by line

Question

I have a continuous stream of data coming in so I want to define the DataFrame before hand so that I don't have tell pandas to format data or set index

So I want to create a DataFrame like

df = pd.DataFrame(columns=["timestamp","stockname","price","volume"])

but I want to tell Pandas that index of data should be timestamp and that the format would be

"%Y-%m-%d %H:%M:%S:%f"

and one this it set, then I would read through file and pass data to the DataFrame initialized

I get data in variables like these populated every time in loop like

for line in filehandle:

    timestamp, stockname, price, volume = fetch(line)

    here I want to update the "df"

this update would go on while I would keep using the copy of

df

let us say into a

tempdf

to do re-sampling or any other task at any given point in time because original dataframe

df

is getting updated continuously

I'm playing devil's advocate here. *why* do you want to do this? if this is a major-fast-pace production tool, do you want to continue to append to a Pandas DataFrame in the first place? Second (almost contradicting my first point) Pandas is fairly speedy. Changing a date format for records isn't too expensive from a machine-stand point. Especially if it is one record, by one record over time — MattR, Dec 08 '17 at 18:11

Evan · Accepted Answer · 2017-12-08T19:32:47.697

1

import numpy as np
import pandas as pd
import datetime as dt
import time

# create df with timestamp as index
df = pd.DataFrame(columns=["timestamp","stockname","price","volume"], dtype = float)
pd.to_datetime(df['timestamp'], format = "%Y-%m-%d %H:%M:%S:%f")
df.set_index('timestamp', inplace = True)

for i in range(10): # for the purposes of functioning demo code
    i += 1 # counter
    time.sleep(0.01) # give jupyter notebook a moment
    timestamp = dt.datetime.now() # to be used as index
    df.loc[timestamp] = ['AAPL', np.random.randint(1000), np.random.randint(10)] # replace with your database read

tempdf = df.copy()

If you are reading a file or database continuously, you can replace the for: loop with what you described in your question. @MattR's questions should also be addressed; if you need to continuously log or update data, I am not sure if pandas is the best solution.

edited Dec 08 '17 at 19:32

answered Dec 08 '17 at 18:22

Evan

2,121
14
27

How to ensure that the DataFrame considers price and volume as float and not object. How do I take care of that during initialization ? – Tahseen Dec 08 '17 at 18:33
pandas is generally decent at inferring data types. You can set specific columns to be numeric with `pd.to_numeric`, if you want to be explicit. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html https://stackoverflow.com/questions/15891038/change-data-type-of-columns-in-pandas – Evan Dec 08 '17 at 18:51
Can you show how you would initialize by making price and volume float by editing the answer ? I am going to mark it as an answer then – Tahseen Dec 08 '17 at 19:12
I specified each column as `dtype = float`. The index is changed to datetime one line later, and `stockname` is coerced to `object` when a string (e.g., `'AAPL'`) is assigned to it. – Evan Dec 08 '17 at 19:34
Well that didn't work. Issue was the the data passed in the loop was string and pandas some how didn't understand it. So I type casted it in the loop itself and it started working fine. So no need for dtype also in the initialization process – Tahseen Dec 08 '17 at 19:38
one issue that am observing is that df.loc is overwriting rows as timestamp is not unique as you see there are multiple stocks and can have their own time same as other stocks – Tahseen Dec 10 '17 at 05:44
I think MattR's question should be addressed. In Pandas, generally, you should not modify something you are iterating over. – Evan Dec 11 '17 at 18:47

Creation of formatted DataFrame and then adding data line by line

1 Answers1