1

I have a large CSV file that has a number of columns, along with a Timestampcolumn. The file is sorted in ascending order of timestamps.

I want to read these values into a pandas dataframe, but only within a specific timestamp range, say from 4:00pm to 5:00pm on a particular date.

What would be the most efficient way to do this? The file is very large, and I want to avoid going through it entirely, since it is sorted I want it to cut off when the upper timestamp is reached, and also want to avoid manual splitting by ,, is there a way to do this using pandas read_csv function?

import pandas as pd
df = pd.read_csv('csvfile', header='infer', ???)

3 Answers3

2

There is not a built in way to use read_csv() to stop at a timestamp. Find the docs here:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

There is a way to process a very large file by breaking it up into chunks:

http://pandas.pydata.org/pandas-docs/stable/io.html#io-chunking

One strategy would be to find the value closest to a timestamp would be argsort(). Get the location of the item closest to your start time and get the location of the item closest to your end time and then get the values between them. I’m not going to code it for you, but I think that approach should work.

Check this out: How do I find the closest values in a Pandas series to an input number?

Aaron Paul
  • 117
  • 3
  • 14
2

There isn't a way to do this right from read_csv() as it will try to load the entire dataset into memory. I'd suggest you iterate over the rows of your input data using standard context manager for csv and .readline() method (be sure to tell csv the proper delimiters..it can do the splitting for you (see example in python doc))

OR...You can apply pd.read_csv() to to each row (tell it that your data doesn't contain headers by setting header=None) or just bite the bullet and use .split(',').

Either way, you can use .to_datetime() to parse the date entry from the appropriate column then use .append() to add it to a data frame as you go.

This can all be in a while loop that tests for when the currently loaded date exceeds your range.

0

The simplest way to do this

df = pd.read_csv(f, parse_dates=['datecolumn', 'datecolumn1'], infer_datetime_format=True)
Flair
  • 2,609
  • 1
  • 29
  • 41
Vetri
  • 51
  • 1
  • 1
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 19 '21 at 10:49