-2

I am using below code to get json filenames in a directory.

import glob
jsonFiles = glob.glob(folderPath+"*.json")

Many new json files get created in the directory per second (say 100/s). Usually it works fine, but gets stuck when no. of files is large (~150000) and takes long (3 - 4 mins) to retrieve filenames. This might be because of large incoming rate (not sure).

Is there any alternative approach to get filenames EFFICIENTLY using python or linux command. Getting oldest 1000 filenames will work too. I don't need all filenames at once.

I came across following shell command:

ls -Art | head -n 1000

Will it help? Does it lists all filenames first, then retrieves 1000 oldest record? Thanks in Advance.

  • 3
    How do you get the oldest N filenames without looking at all the files? – RoadRunner Jan 30 '20 at 13:37
  • Just to be sure: That code alone takes 3 - 4 mins? Just the globbing, without also fetching file ages? – Kelly Bundy Jan 30 '20 at 13:40
  • 1
    Did you see this other question/answer https://stackoverflow.com/questions/8931099/quicker-to-os-walk-or-glob ? Even the slowest one takes about 3 second with about a million files, so I think there's something else going on – ChatterOne Jan 30 '20 at 13:59
  • 1
    To reinforce what @RoadRunner said, seems like you'd have to _all_ the file in order to determine which were the oldest — so that's the most likely area to be causing the slowdown. – martineau Jan 30 '20 at 14:12
  • @martineau I was thinking maybe linux stores the files in some sorted order by default (maybe name/size/creation timestamp/modified timestamp, etc.) and there might be a way to retrieve top N filenames without looking at _all_ of them. – kubera kalyan Jan 31 '20 at 09:13
  • @ChatterOne The link really helped. I also experienced similar results and planning to use os.listdir. Generally it retrieves 1M filenames in 2-3 seconds. Can the issue be because of heavy server load or something which results in slowing down of this script? I have noted some unresponsive behaviour from server at those moments. – kubera kalyan Jan 31 '20 at 09:20

1 Answers1

0

Found scandir to be useful.

# Python version 2.x
import scandir
ds = scandir.scandir('./files/')
fileNames = []
count=0
for file in ds:
    count+= 1
    fileNames.append(file.name)
    if count==1000:
        break

# Python version 3.x
import os
ds = os.scandir('./files/')
...

This gives 1000 random fileNames in the directory without looking at all of the fileNames. If we don't break out of loop, it will continue to provide file names in random order (filename once given won't be repeated).