0

I have a huge dataset of students, each student has its own csv file, Dataset B has 297,444 csv files, I want to know which student csv file is missing in that dataset.

Like you see in this picture, there's no u2.csv file present in that dataset so how can I check which all of the csv files that are missing using pandas?

Here's the code I tried so far

import pandas as pd
import numpy as np
import glob

path = r'C:/Users/user1/Desktop/EDNET DATA/EdNet-KT4/KT4' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for i,filename in enumerate (all_files):
    
    df = pd.read_csv(filename, ',' ,index_col=None, header=0).assign(user_iD=filename.split("\\")[-1].split(".")[0])
    
    li.append(df)

data = pd.concat(li, axis=0, ignore_index=True)
df = data.copy()

df.isnull().sum()

df.to_feather('KT4.ftr')
data1= pd.read_feather('KT4.ftr')
data1.head()

enter image description here

enter image description here

Paul Brennan
  • 2,638
  • 4
  • 19
  • 26
Haseeb Sultan
  • 91
  • 3
  • 14
  • What have you tried so far? Show us your code! – Klaus D. Feb 15 '21 at 07:43
  • Add the existing code, how are you accessing the files. – MubtadaNaqvi Feb 15 '21 at 07:45
  • I have just concatinated all those csv files into a single feather file, and as of yet I just checked what are the missing values (Nan Values) present inside the dataset, I'm not sure how to check which of the csv files are missing? – Haseeb Sultan Feb 15 '21 at 07:46
  • Try some easier thing first: Get a list of all student names, then a list of all file names. Then loop through all student names and for each check if it is in the other list. no need to read the files – 576i Feb 15 '21 at 07:57
  • Cannot get a list of all student names, This is a public dataset available on Github, It's not my personal dataset – Haseeb Sultan Feb 15 '21 at 07:59
  • In the loop where you are enumerating on the file names, why don't you get the number value using `filename[1:]` and compare it with another variable say `num` and while they are not equal increment `num` untill they are equal and store the missing file in a list or so. Of course you have to sort the list of file names first before entering the loop. – Volpe95 Feb 15 '21 at 08:01
  • 2
    I don't think you should use pandas for that. Just create a list of all the files in the directory, remove the 'u' prefix, and then find the missing numbers (you can create a new list of all the numbers from 1 to the max number in your original list), and then remove from the new list the numbers from the original one. – roishik Feb 15 '21 at 08:12

1 Answers1

1

Solution

Note: You only need the list of files names. But what you are doing in the code you posted, is reading the contents of the files (which is not what you want)!

You could choose to use any of the following two methods. For the sake of reproducibility, I have created some dummy data and I tested the solution on Google Colab. I found that using pandas (Method-2) was somehow faster.

Open In Colab

Common Code

import glob
# import pandas as pd

all_files = glob.glob(path + "/*.csv")

# I am deliberately using this for 
#   a small number of students to 
#   test the code.
num_students = 20 # 297444

Method-1: Simple Python Loop

  • For 100,000 files, it took around 1min 29s on Google Colab.
  • Run the following in a jupyter-notebook cell.
%%time
missing_files = []

for i in range(15):
    student_file = f'u{i}.csv'
    if f'{path}/{student_file}' not in all_files:
        missing_files.append(student_file)

#print(f"Total missing: {len(missing_files)}")
#print(missing_files)

## Runtime
# CPU times: user 1min 29s, sys: 0 ns, total: 1min 29s
# Wall time: 1min 29s

Method-2: Process using Pandas library (faster)

  • For 100,000 files, it took around 358 ms on Google Colab.
  • Almost 250 times FASTER than method-1.
  • Run the following in a jupyter-notebook cell.
%%time
# import pandas as pd

existing_student_ids = (
    pd.DataFrame({'Filename': all_files})
      .Filename.str.extract(f'{path}/u(?P<StudentID>\d+)\.csv')
      .astype(int)
      .sort_values('StudentID')
      .StudentID.to_list()
)

missing_student_ids = list(set(range(num_students)) - set(existing_student_ids))

# print(f"Total missing students: {len(missing_student_ids)}")
# print(f'missing_student_ids: {missing_student_ids}')

## Runtime
# CPU times: user 323 ms, sys: 31.1 ms, total: 354 ms
# Wall time: 358 ms

Dummy Data

Here I will define some dummy data for the purpose of making the solution reproducible and easily testable.

I will skip the following student-ids (skip_student_ids) and NOT create any .csv file for them.

import os

NUM_STUDENTS = 20

## CREATE FILE NAMES
num_students = NUM_STUDENTS
skip_student_ids = [3, 8, 10, 13] ## --> we will skip these student-ids
skip_files = [f'u{i}.csv' for i in skip_student_ids]
all_files = [f'u{i}.csv' for i in range(num_students) if i not in skip_student_ids]

if num_students <= 20:
    print(f'skip_files: {skip_files}')
    print(f'all_files: {all_files}')

## CREATE FILES
path = 'test'
if not os.path.exists(path):
    os.makedirs(path)
for filename in all_files:
    with open(path + '/' + filename, 'w') as f:
        student_id = str(filename).split(".")[0].replace('u', '')
        content = f"""
        Filename,StudentID
        {filename},{student_id}
        """
        f.write(content)

References

  1. pandas.Series.str.extract - Docs

  2. Can I add message to the tqdm progressbar?

CypherX
  • 7,019
  • 3
  • 25
  • 37