I want to read in different CSV files and while doing that, convert the time column to seconds since the epoch. However, the date_parser gets applied to more then the specified column, and my data is butchered.
here is my code and some example data:
import pandas as pd
TIME_STG = "Datum (UTC)"
PRICE_STG = "Day Ahead Auktion (DE-LU)"
PRICE_FILE = "booking_algorythm/data/energy-charts_Stromproduktion_und_Börsenstrompreise_in_Deutschland_2021.csv"
def get_data(file, *columns):
types_dict = {}
parse_dates_list = []
for column in columns:
if column == TIME_STG:
types_dict.update({column: str})
parse_dates_list.append(column)
else:
types_dict.update({column: float})
data = pd.read_csv(file,
sep=",",
usecols=columns,
dtype=types_dict,
parse_dates=parse_dates_list,
date_parser=lambda col: pd.to_datetime(col, utc=True)).astype(int) // 10**9
data_np = data.to_numpy()
return data_np
def get_price_vector():
data = get_data(PRICE_FILE, PRICE_STG, TIME_STG)
return data
def main():
vector = get_price_vector()
print(vector)
if __name__ == "__main__":
main()
example data
"Datum (UTC)","Kernenergie","Nicht Erneuerbar","Erneuerbar","Last","Day Ahead Auktion (DE-LU)"
2021-01-01T00:00:00.000Z,8151.12,35141.305,11491.71,43516.88,48.19
2021-01-01T00:15:00.000Z,8147.209,34875.902,11331.25,42998.01,48.19
2021-01-01T00:30:00.000Z,8154.02,34825.553,11179.375,42494.2,48.19
2021-01-01T00:45:00.000Z,8152.82,34889.11,11072.377,42320.32,48.19
2021-01-01T01:00:00.000Z,8156.53,34922.123,10955.356,41598.39,44.68
2021-01-01T01:15:00.000Z,8161.601,34856.2,10867.771,41214.32,44.68
2021-01-01T01:30:00.000Z,8158.36,35073.1,10789.049,40966.95,44.68
2021-01-01T01:45:00.000Z,8151.3,34972.501,10657.209,40664.63,44.68
2021-01-01T02:00:00.000Z,8145.589,34911.037,10637.605,40502.78,42.92
and this is the unexpected output - I had expected the price column to be actual data like 44.68.
.astype(int) // 10**9
is a fast conversion to seconds since the epoch that I found here on StackOverflow.