Pandas read_csv() parse_dates does not limit itself to the specied column - how to fix?

Question

I want to read in different CSV files and while doing that, convert the time column to seconds since the epoch. However, the date_parser gets applied to more then the specified column, and my data is butchered.

here is my code and some example data:

import pandas as pd


TIME_STG = "Datum (UTC)"
PRICE_STG = "Day Ahead Auktion (DE-LU)"
PRICE_FILE = "booking_algorythm/data/energy-charts_Stromproduktion_und_Börsenstrompreise_in_Deutschland_2021.csv"

def get_data(file, *columns):
    types_dict = {}
    parse_dates_list = []
    for column in columns:
        if column == TIME_STG:
            types_dict.update({column: str})
            parse_dates_list.append(column)
        else:
            types_dict.update({column: float})
    data = pd.read_csv(file,
                       sep=",",
                       usecols=columns,
                       dtype=types_dict,
                       parse_dates=parse_dates_list,
                       date_parser=lambda col: pd.to_datetime(col, utc=True)).astype(int) // 10**9
    data_np = data.to_numpy()
    return data_np

def get_price_vector():
    data = get_data(PRICE_FILE, PRICE_STG, TIME_STG)
    return data

def main():
    vector = get_price_vector()
    print(vector)

if __name__ == "__main__":
    main()

example data

"Datum (UTC)","Kernenergie","Nicht Erneuerbar","Erneuerbar","Last","Day Ahead Auktion (DE-LU)"
2021-01-01T00:00:00.000Z,8151.12,35141.305,11491.71,43516.88,48.19
2021-01-01T00:15:00.000Z,8147.209,34875.902,11331.25,42998.01,48.19
2021-01-01T00:30:00.000Z,8154.02,34825.553,11179.375,42494.2,48.19
2021-01-01T00:45:00.000Z,8152.82,34889.11,11072.377,42320.32,48.19
2021-01-01T01:00:00.000Z,8156.53,34922.123,10955.356,41598.39,44.68
2021-01-01T01:15:00.000Z,8161.601,34856.2,10867.771,41214.32,44.68
2021-01-01T01:30:00.000Z,8158.36,35073.1,10789.049,40966.95,44.68
2021-01-01T01:45:00.000Z,8151.3,34972.501,10657.209,40664.63,44.68
2021-01-01T02:00:00.000Z,8145.589,34911.037,10637.605,40502.78,42.92

and this is the unexpected output - I had expected the price column to be actual data like 44.68.

.astype(int) // 10**9 is a fast conversion to seconds since the epoch that I found here on StackOverflow.

FObersteiner · Answer 1 · 2023-03-12T10:01:37.110

Your example works fine for me if I convert to Unix time after import, i.e.

import pandas as pd
from io import StringIO

csvdata = """Datum (UTC),Kernenergie,Nicht Erneuerbar,Erneuerbar,Last,Day Ahead Auktion (DE-LU)
2021-01-01T00:00:00.000Z,8151.12,35141.305,11491.71,43516.88,48.19
2021-01-01T00:15:00.000Z,8147.209,34875.902,11331.25,42998.01,48.19
2021-01-01T00:30:00.000Z,8154.02,34825.553,11179.375,42494.2,48.19
2021-01-01T00:45:00.000Z,8152.82,34889.11,11072.377,42320.32,48.19
2021-01-01T01:00:00.000Z,8156.53,34922.123,10955.356,41598.39,44.68
2021-01-01T01:15:00.000Z,8161.601,34856.2,10867.771,41214.32,44.68
2021-01-01T01:30:00.000Z,8158.36,35073.1,10789.049,40966.95,44.68
2021-01-01T01:45:00.000Z,8151.3,34972.501,10657.209,40664.63,44.68
2021-01-01T02:00:00.000Z,8145.589,34911.037,10637.605,40502.78,42.92
"""

df = pd.read_csv(StringIO(csvdata), parse_dates=["Datum (UTC)"])
df["Timestamp[s]"] = df["Datum (UTC)"].astype(int) // 10**9

print(df.dtypes)
# Datum (UTC)                  datetime64[ns, UTC]
# Kernenergie                              float64
# Nicht Erneuerbar                         float64
# Erneuerbar                               float64
# Last                                     float64
# Day Ahead Auktion (DE-LU)                float64
# Timestamp[s]                               int64
# dtype: object

In general, you might even want to do the conversion to datetime after import, to get more control easier. Using vectorized methods instead of lambdas (= iteration on the Python side, unless covered by special optimizations) is also more efficient in my experience.

if you want to keep the fractional seconds, you can divide the Unix time nanoseconds with / 10**9, which results in a float. Besides, 1e9 is a float, so the result would always be a float, not matter if you use integer division //. See also Why does integer division yield a float instead of another integer.
if you'd like to re-arrange the order of the columns, you can do so as described here. Besides, if you want to select only certain columns when creating an output e.g. to csv, you can do so as described here.

How comes your column displays as type float64, if you cast it to int just before? — Andreas Schuldei, Mar 12 '23 at 07:28
With your stepwise parsing and conversion, I get the problem of the time becoming the last column. How can I retain the order of columns? — Andreas Schuldei, Mar 12 '23 at 07:30
@AndreasSchuldei I've added some notes, hope that clarifies. — FObersteiner, Mar 12 '23 at 09:36

Pandas read_csv() parse_dates does not limit itself to the specied column - how to fix?

1 Answers1