I have a 20 GB trades.csv
file. It has two columns (trade_time and price). And the csv file contains 650 million rows.
Sample Data
https://gist.github.com/dsstex/bc885ed04a6de98afc7102ed08b78608
Pandas DataFrame
df = pd.read_csv("trades.csv", index_col=0, parse_dates=True)
I would like to check whether price is up or down based on percentage. If the price hits up_value (e.g. +1%) first, the result is 1. If the price hits down_value (e.g. -0.5%) first, then the result is 0. I need to do this for all 650 million rows.
At the moment, the dataframe has only two columns. trade_time(index), price
. I would like to have a new column named "result".
import pandas as pd
df = pd.read_csv("trades.csv", index_col=0, parse_dates=True)
df["result"] = None
print(df)
up_percentage = 0.2
down_percentage = 0.1
def calc_value_from_percentage(percentage, whole):
return (percentage / 100) * whole
def set_result(index):
up_value = 0
down_value = 0
for _, current_row_price, _ in df.loc[index:].itertuples():
if up_value == 0 or down_value == 0:
up_delta = calc_value_from_percentage(up_percentage, current_row_price)
down_delta = calc_value_from_percentage(down_percentage, current_row_price)
up_value = current_row_price + up_delta
down_value = current_row_price - down_delta
if current_row_price > up_value:
df.loc[index, "result"] = 1
return
if current_row_price < down_value:
df.loc[index, "result"] = 0
return
for ind, _, _ in df.itertuples():
set_result(ind)
df.to_csv("results.csv", index=True, header=True)
print(df)
Results
https://gist.github.com/dsstex/fe3759beedbf9c46ace382a7eef3d12c
Note: Due to insufficient data, most of the bottom rows in the above file has the value "None" for "result". So the value is blank.
At the moment, I'm using pandas itertuples()
to process the file. I would like to have a vectorized solution since I have a huge file.
Note: Last month I asked this question. This is a follow-up question. And it is related to this answer. In that answer, the author is using a fixed size up_value/down_value
of 200
. But I'm after a percentage based vectorized solution.
Any help is greatly appreciated.
Thanks