Find the latest datetime for each date in a dataframe PANDAS

Question

I have a folder on my computer that contains ~8500 .csv files that are all names of various stock tickers. Within each .csv file, there is a 'timestamp' and 'users_holding' column. I have the 'timestamp' column set up as a datetime index, as the entries in that column include hourly entries for each day ex/ 2019-12-01 01:50, 2020-01-01 02:55... 2020-01-01 01:45 etc. Each one of those timestamps has a corresponding integer representing the number of users holding at that time. I want to create a for loop that iterates through all of the .csv files and tallies up the total users holding across all .csv files for the latest time every day starting on February 1st, 2020 (2020-02-01) until the last day in the .csv file. The folder updates daily, so I can't really have an end date.

This is the for loop I have set up to establish each ticker as a dataframe:

path = 'C:\\Users\\N****\\Desktop\\r******\\t**\\p*********\\'
all_files = glob.glob(path + "/*.csv")

for filename in all_files:
    df = pd.read_csv(filename, header = 0, parse_dates = ['timestamp'], index_col='timestamp')

If anyone could show me how to write the for loop that finds the latest entry for each date and tallies up that number for each day, that would be amazing.

Thank you!

score 1 · Answer 1 · answered Jul 29 '20 at 19:55

1

First, create a data frame with a Datetime index (in one-hour steps):

import numpy as np
import pandas as pd

idx = pd.date_range(start='2020-01-01', end='2020-01-31', freq='H')
data = np.arange(len(idx) * 3).reshape(len(idx), 3)
columns = ['ticker-1', 'ticker-2', 'ticker-3']
df = pd.DataFrame(data=data, index=idx, columns=columns)

print(df.head())

                     ticker-1  ticker-2  ticker-3
2020-01-01 00:00:00         0         1         2
2020-01-01 01:00:00         3         4         5
2020-01-01 02:00:00         6         7         8
2020-01-01 03:00:00         9        10        11
2020-01-01 04:00:00        12        13        14

Then, groupby the index (keep year-month-day), but drop hours-minutes-seconds). The aggregation function is .last()

result = (df.groupby(by=df.index.strftime('%Y-%m-%d'))
          [['ticker-1', 'ticker-2', 'ticker-3']]
          .last()
         )

print(result.head())

            ticker-1  ticker-2  ticker-3
2020-01-01        69        70        71
2020-01-02       141       142       143
2020-01-03       213       214       215
2020-01-04       285       286       287
2020-01-05       357       358       359

answered Jul 29 '20 at 19:55

jsmart

2,921
1
6
13

How would I go about creating one dataframe with all the tickers from my folder containing all the .csv fies? – nwhite1130 Jul 29 '20 at 21:21
Here is a good SO post for converting multiple CSV files to one data frame: https://stackoverflow.com/a/36416258/13608599 – jsmart Jul 29 '20 at 22:00
That article is very helpful and would work, except I dont think I have enough RAM, and I've tried using chunksize and that doesnt work (I have 8500 csv files with 19000 rows each). I think I have to use a for loop to iterate through each csv file and add up the users_holding for each day from 02-01-2020 until the last day in the spreadsheet (df.index.max()). Do you have any idea how i could set up that for loop? I'm pretty new to coding – nwhite1130 Jul 30 '20 at 15:05
Current: Concatenate all csv files into one data frame. Then find daily max for each date. Proposed: Read one csv file; find daily max values, append this summary info to a list; repeat for each csv file; and concatenate summary after processing all csv files. In other words: reduce as you go. If this is helpful, please consider an up-vote. – jsmart Jul 30 '20 at 15:27

Find the latest datetime for each date in a dataframe PANDAS

1 Answers1