Solution
Note: You only need the list of files names. But what you are doing in the code you posted, is reading the contents of the files (which is not what you want)!
You could choose to use any of the following two methods. For the sake of reproducibility, I have created some dummy data and I tested the solution on Google Colab. I found that using pandas (Method-2) was somehow faster.

Common Code
import glob
# import pandas as pd
all_files = glob.glob(path + "/*.csv")
# I am deliberately using this for
# a small number of students to
# test the code.
num_students = 20 # 297444
Method-1: Simple Python Loop
- For
100,000
files, it took around 1min 29s on Google Colab.
- Run the following in a jupyter-notebook cell.
%%time
missing_files = []
for i in range(15):
student_file = f'u{i}.csv'
if f'{path}/{student_file}' not in all_files:
missing_files.append(student_file)
#print(f"Total missing: {len(missing_files)}")
#print(missing_files)
## Runtime
# CPU times: user 1min 29s, sys: 0 ns, total: 1min 29s
# Wall time: 1min 29s
Method-2: Process using Pandas library (faster)
- For
100,000
files, it took around 358 ms on Google Colab.
- Almost
250 times
FASTER than method-1.
- Run the following in a jupyter-notebook cell.
%%time
# import pandas as pd
existing_student_ids = (
pd.DataFrame({'Filename': all_files})
.Filename.str.extract(f'{path}/u(?P<StudentID>\d+)\.csv')
.astype(int)
.sort_values('StudentID')
.StudentID.to_list()
)
missing_student_ids = list(set(range(num_students)) - set(existing_student_ids))
# print(f"Total missing students: {len(missing_student_ids)}")
# print(f'missing_student_ids: {missing_student_ids}')
## Runtime
# CPU times: user 323 ms, sys: 31.1 ms, total: 354 ms
# Wall time: 358 ms
Dummy Data
Here I will define some dummy data for the purpose of making
the solution reproducible and easily testable.
I will skip the following student-ids (skip_student_ids
) and NOT create any .csv
file for them.
import os
NUM_STUDENTS = 20
## CREATE FILE NAMES
num_students = NUM_STUDENTS
skip_student_ids = [3, 8, 10, 13] ## --> we will skip these student-ids
skip_files = [f'u{i}.csv' for i in skip_student_ids]
all_files = [f'u{i}.csv' for i in range(num_students) if i not in skip_student_ids]
if num_students <= 20:
print(f'skip_files: {skip_files}')
print(f'all_files: {all_files}')
## CREATE FILES
path = 'test'
if not os.path.exists(path):
os.makedirs(path)
for filename in all_files:
with open(path + '/' + filename, 'w') as f:
student_id = str(filename).split(".")[0].replace('u', '')
content = f"""
Filename,StudentID
{filename},{student_id}
"""
f.write(content)
References
pandas.Series.str.extract
- Docs
Can I add message to the tqdm progressbar?