Splitting a list of file names in a predefined ratio

Question

I am trying to form an optimized approach to splitting a list of file names(examples shortly) in a x:y ratio based on the file names. This file list was procured using os.scandir (better performance vs os.listdir, src: Python Docs scandir).

Example -

Files (extension disregarded)-

A_1,A_2,...A_10 (here A is filename and 1 is the sample number of the file)

B_1,B_2,...B_10

and so on

Let's say the x:y ratio is 7:3 So I would like 70% of file names (A_1..A7,B_1..B_7) and 30%(A_8--A_10,B_8..B_10) in different lists, it does not matter that the first list should be in that order meaning the files could be A_1,A_9,A_5 etc as long as they are split 7 files in list 1 to 3 files in list 2.

Now it must be noted that this directory is huge (~150k files) and the samples of each type of files vary, i.e. it maybe that files with filename A have 1000 files or it may have only 5. Also there are about 400 unique filenames.

This current solution should not be called a solution at all as it defies the purpose of an accurate ratio for each filename. It is currently splitting the list of fileObjects(basically- name like A, number like 1, data within file A_1 and so on) as a whole in x:y ratio and taking advantage of the fact that entries are yielded in arbitrary order when using os.scandir.

ratio_number = int(len(list_of_fileObjects) *.7)
list_70 = list_of_fileObjects[:ratio_number]
list_30 = list_of_fileObjects[ratio_number:]

My second approach which would at least be a valid solution was to create a list separately for each filename(involves sorting the whole list of files), split it in the ratio and do this for each filename. I am looking for a more pythonic/elegant solution to this problem. Any suggestions or help would be appreciated especially considering the size of data being dealt with.

What would be helpful is to know why the downvotes? I am new to the forum and asking questions especially with some research and explaining approaches should be encouraged. This saddens me and what is worse is people who downvoted left no comments. — Shivansh Singh, Aug 18 '16 at 15:15

score 0 · Answer 1 · answered Aug 17 '16 at 23:49

If I understand the situation correctly, your trying to partition the same proportion of each filename prefix's files. Your current method selects the correct proportion from the whole set of files, but it doesn't consider the different filename prefixes, so it may not get them in the correct proportion (though it will probably be somewhat close, most of the time).

Your second approach avoids that issue by first separating the filenames by prefix, then partitioning each sublist. But if you want a combined list with all the prefixes together, this approach may end up wasting time copying data around, since you have to separate out and then recombine the separate lists by prefix.

I think you can do what you want with a single loop over the filenames. You'll need to keep track of two data points for each filename prefix: The number of files with that prefix you've selected for the first sample and the total number of files with that prefix that you've seen.

ratio = 0.7
prefix_dict = {} # values are lists: [number_selected_for_first_list, total_number_seen]
first_sample = [] # gets a proportion of the files equal to ratio (for each prefix)
second_sample = [] # gets the rest of the files

for filename in list_of_files:
    prefix = filename.split("_", 1)[0]
    selected_seen = prefix_dict.setdefault(prefix, [0, 0])
    selected_seen[1] += 1

    if selected_seen[0] < round(ratio * selected_seen[1]):
        first_sample.append(filename)
        selected_seen[0] += 1
    else:
        second_sample.append(filename)

The only tricky part to this code is the use of dict.setdefault to fetch the selected_seen list. It if the requested prefix didn't yet exist in the dictionary, a new value ([0, 0]) will be added to the dictionary under that key (and returned). The later code modifies the list in place.

Depending on how exactly you want to handle inexact proportions, you can change the if condition a bit. I put in a round call (which I think will partition most accurately), but the code would work OK without it (biasing the selection towards the second sample) or with selected_seen[0] <= int(ratio * selected_seen[1]) (biasing towards the first sample).

Note that whichever way you choose to round when partitioning each prefix, there's the possibility that the separate prefixes will all end up unbalanced in the same direction, making the overall samples unbalanced by more than you'd normally expect. For instance, if you had ten prefixes with ten files (for 100 files total), a ratio of 7.5 would result in final sample lists of 80 and 20 files rather than 75 and 25. That happens since each of the prefixes gets partitioned 8 and 2 (7.5 rounds up). If every file had a unique prefix, you'd end up with everything in the first sample! If it's very important that the overall samples be the right sizes, you might need to fudge the sampling of the items a bit, based on the overall sample sizes.

Thank you @Blckknght I will definitely test out this approach and let you know the results, appreciate your help, still do not have enough points to vote up your answer. — Shivansh Singh, Aug 18 '16 at 17:29

score 0 · Answer 2 · answered Aug 30 '16 at 21:14

I figured out a good solution to this problem.

all_file_names = {}

# ObjList is a list of objects but we only need  
# file_name from that object for our solution

for x in ObjList:
    if x.file_name not in all_file_names:
        all_file_names[x.file_name] = 1
    else:
        all_file_names[x.file_name] += 1

trainingData = []
testData = []
temp_dict = {}

for x in ObjList:
    ratio = int(0.7*all_file_names[x.file_name])+1
    if x.file_name not in temp_dict:
        temp_dict[x.file_name] = 1
        trainingData.append(x)
    else:
        temp_dict[x.file_name] += 1
        if(temp_dict[x.file_name] < ratio):
            trainingData.append(x)
        else:
            testData.append(x)

Splitting a list of file names in a predefined ratio

2 Answers2

Linked