I am trying to form an optimized approach to splitting a list of file names(examples shortly) in a x:y ratio based on the file names. This file list was procured using os.scandir (better performance vs os.listdir, src: Python Docs scandir).
Example -
Files (extension disregarded)-
A_1,A_2,...A_10 (here A is filename and 1 is the sample number of the file)
B_1,B_2,...B_10
and so on
Let's say the x:y ratio is 7:3 So I would like 70% of file names (A_1..A7,B_1..B_7) and 30%(A_8--A_10,B_8..B_10) in different lists, it does not matter that the first list should be in that order meaning the files could be A_1,A_9,A_5 etc as long as they are split 7 files in list 1 to 3 files in list 2.
Now it must be noted that this directory is huge (~150k files) and the samples of each type of files vary, i.e. it maybe that files with filename A have 1000 files or it may have only 5. Also there are about 400 unique filenames.
This current solution should not be called a solution at all as it defies the purpose of an accurate ratio for each filename. It is currently splitting the list of fileObjects(basically- name like A, number like 1, data within file A_1 and so on) as a whole in x:y ratio and taking advantage of the fact that entries are yielded in arbitrary order when using os.scandir.
ratio_number = int(len(list_of_fileObjects) *.7)
list_70 = list_of_fileObjects[:ratio_number]
list_30 = list_of_fileObjects[ratio_number:]
My second approach which would at least be a valid solution was to create a list separately for each filename(involves sorting the whole list of files), split it in the ratio and do this for each filename. I am looking for a more pythonic/elegant solution to this problem. Any suggestions or help would be appreciated especially considering the size of data being dealt with.