2

I am using python to load a csv file for processing.

The directory contains many files and is constantly updated. When I run the script, I want it to select only the most recently updated csv file in the directory for processing.

I have code that seems to do this, but it doesn't do it reliably. Often it takes the last csv file, as intended, but sometimes it takes an older file and skips the most recent. I think it's probably sorting alpha-numerically instead of by created/updated time.

Can someone please suggest a change to the code that would make it work more reliably?

Current code:

# Import python modules
import pandas as pd
import os

#Identify last csv file in directory
last_csv = sorted(list(filter(lambda x: '.csv' in x, os.listdir())))[-1]

#load csv into a pandas dataframe
df = pd.read_csv(last_csv, skip_blank_lines=False, header=[8], engine='python') 

I've seen bash and java versions in other threads, but is there a way to do it with python?

This thread describes how to sort a list of files, but it works by parsing the date and time from the filename. I want to be able to find the latest file even if the updated time is not part of the filename.

Python combining all csv files in a directory and order by date time

Thanks all

  • 2
    https://stackoverflow.com/questions/237079/how-to-get-file-creation-modification-date-times-in-python – Daniel Lee Nov 15 '19 at 16:52
  • 1
    Thanks all. Both ggorlen and SpghttCd offered solutions that worked perfectly. ggorlen with one less import. – Pipercollins Nov 18 '19 at 17:06
  • 1
    Glad it worked out. I should also note that if you need the top `n` most recent, `sort` is great, but if you just want the top, it's overkill and less semantic to sort the entire thing just to take the last element; prefer `max`. – ggorlen Nov 18 '19 at 17:42
  • Just for the sake of interest - why do you think you need more imports using pathlib? Afaics it's only one import and it belongs to the standard library. – SpghttCd Nov 19 '19 at 00:33
  • OK, obviously I'm a noob and it's easy to get over my head, but one more import line takes that much more time, right? (For ggorlen's version, I'd already imported os.) I mean, not to overstate things, we're talking about things that really aren't important at my scale, and you've really got equally good approaches in practical terms to me. Many thanks. – Pipercollins Nov 19 '19 at 16:15

3 Answers3

2

Look at your sorting attempt:

last_csv = sorted(list(filter(lambda x: '.csv' in x, os.listdir())))[-1]

This is quite clear (well, if you know ...): list the files in the directory, select only .csv files, make a list, and sort it. The .csv is a loud clue: you are working with the file names; your sort is by collating sequence ("alphabetical" order with more characters).

There is nothing in your code that obtains the file creation/update time. You need to fetch that with the directory listing, or iterate through your list of files and fetch the desired time stamp for each. Once you've done that, simply sort with the time stamp as the sort key.

These steps are documented well; I expect that you can research the file-time call and the sort-with-key materials.

Prune
  • 76,765
  • 14
  • 60
  • 81
1

Here's one approach:

import os

path = "."
csvs = [x for x in os.listdir(path) if os.path.isfile(x) and x.endswith(".csv")]
most_recent = max(csvs, key=lambda x: os.stat(os.path.join(path, x)).st_mtime)

print(most_recent)

Open the directory and filter by .csv extension. Then take the max of the files by st_mtime property. This is the most recently modified CSV file in the target directory.

ggorlen
  • 44,755
  • 7
  • 76
  • 106
1

I think I'd use pathlib

from pathlib import Path

fld = '.'
files = Path(fld).glob('*.csv')

latest = max(files, key=lambda f: f.stat().st_mtime)

Edit: changed sorted to max thanks to @ggorlen's completely correct comment on that.

SpghttCd
  • 10,510
  • 2
  • 20
  • 25