Sorting a list based on another shorter list with incomplete values

Question

I have a list of file paths which I need to order in a specific way prior to reading and processing the files. The specific way is defined by a smaller list which contains only some file names, but not all of them. All other file paths which are not listed in presorted_list need to stay in the order they had previously.

Examples:

some_list = ['path/to/bar_foo.csv',
             'path/to/foo_baz.csv',
             'path/to/foo_bar(ignore_this).csv',
             'path/to/foo(ignore_this).csv',
             'other/path/to/foo_baz.csv']

presorted_list = ['foo_baz', 'foo']

expected_list = ['path/to/foo_baz.csv',
                 'other/path/to/foo_baz.csv',
                 'path/to/foo(ignore_this).csv',
                 'path/to/bar_foo.csv',
                 'path/to/foo_bar(ignore_this).csv']

I've found some relating posts:

But as far as I can tell the questions and answers always rely on two lists of the same length which I don't have (which results in errors like ValueError: 'bar_foo' is not in list) or a presorted list which needs to contain all possible values which I can't provide.

My Idea:

I've come up with a solution which seems to work but I'm unsure if this is a good way to approach the problem:

import os
import re

EXCPECTED_LIST = ['path/to/foo_baz.csv',
                  'other/path/to/foo_baz.csv',
                  'path/to/foo(ignore_this).csv',
                  'path/to/bar_foo.csv',
                  'path/to/foo_bar(ignore_this).csv']

PRESORTED_LIST = ["foo_baz", "foo"]


def sort_function(item, len_list):
    # strip path and unwanted parts
    filename = re.sub(r"[\(\[].*?[\)\]]", "", os.path.basename(item)).split('.')[0]

    if filename in PRESORTED_LIST:
        return PRESORTED_LIST.index(filename)
    return len_list


def main():
    some_list = ['path/to/bar_foo.csv',
                 'path/to/foo_baz.csv',
                 'path/to/foo_bar(ignore_this).csv',
                 'path/to/foo(ignore_this).csv',
                 'other/path/to/foo_baz.csv',]
    list_length = len(some_list)
    sorted_list = sorted(some_list, key=lambda x: sort_function(x, list_length))

    assert sorted_list == EXCPECTED_LIST


if __name__ == "__main__":
    main()

Are there other (shorter, more pythonic) ways of solving this problem?

Is it possible that two paths in `some_list` have the same file name included in `presorted_list`? In that case, should they maintain their relative positions? — jdehesa, Apr 18 '18 at 10:08
Good point! I totally forgot about that situation. Yes, it's possible that two paths have the same file name and they should maintain their positions. I'll adjust my question. — coreuter, Apr 18 '18 at 10:13

jdehesa · Answer 1 · 2018-04-18T10:26:36.807

Here is how I think I would do it:

import re
from collections import OrderedDict
from itertools import chain

some_list = ['path/to/bar_foo.csv',
             'path/to/foo_baz.csv',
             'path/to/foo_bar(ignore_this).csv',
             'path/to/foo(ignore_this).csv',
             'other/path/to/foo_baz.csv']
presorted_list = ['foo_baz', 'foo']
expected_list = ['path/to/foo_baz.csv',
                 'other/path/to/foo_baz.csv',
                 'path/to/foo(ignore_this).csv',
                 'path/to/bar_foo.csv',
                 'path/to/foo_bar(ignore_this).csv']

def my_sort(lst, presorted_list):
    rgx = re.compile(r"^(.*/)?([^/(.]*)(\(.*\))?(\.[^.]*)?$")
    d = OrderedDict((n, []) for n in presorted_list)
    d[None] = []
    for p in some_list:
        m = rgx.match(p)
        n = m.group(2) if m else None
        if n not in d:
            n = None
        d[n].append(p)
    return list(chain.from_iterable(d.values()))

print(my_sort(some_list, presorted_list) == expected_list)
# True

score 1 · Answer 2 · answered Apr 18 '18 at 10:39

An easy implementation is to add some sentinels to the lines before sorting. So there is no need for specific ordering. Also regex may be avoid if all filenames respect the pattern you gave:

for n,file1 in enumerate(presorted_list):
    for m,file2 in enumerate(some_list):
        if '/'+file1+'.' in file2 or '/'+file1+'(' in file2:
            some_list[m] = "%03d%03d:%s" % (n, m, file2)
some_list.sort()
some_list = [file.split(':',1)[-1] for file in some_list]
print(some_list)

Result:

['path/to/foo_baz.csv',
 'other/path/to/foo_baz.csv',
 'path/to/foo(ignore_this).csv',
 'path/to/bar_foo.csv',
 'path/to/foo_bar(ignore_this).csv']

Kenstars · Answer 3 · 2018-04-18T10:45:15.667

Let me think. It is a unique problem, I'll try to suggest a solution

only_sorted_elements = filter(lambda x:x.rpartition("/")[-1].partition(".")[0] in presorted_list , some_list)
only_sorted_elements.sort(key = lambda x:presorted_list.index(x.rpartition("/")[-1].partition(".")[0]))
expected_list = []
count = 0
for ind, each_element in enumerate(some_list):
    if each_element not in presorted_list:
       expected_list.append(each_element)
    else:
       expected_list[ind].append(only_sorted_elements[count])
       count += 1

Hope this solves your problem. I first filter for only those elements which are there in presorted_list, then I sort those elements according to its order in presorted_list

Then I iterate over the list and append accordingly.

Edited :

Changed index parameters from filename with path to exact filename. This will retain the original indexes of those files which are not in presorted list.

EDITED : The new edited code will change the parameters and gives sorted results first and unsorted later.

some_list = ['path/to/bar_foo.csv',
             'path/to/foo_baz.csv',
             'path/to/foo_bar(ignore_this).csv',
             'path/to/foo(ignore_this).csv',
             'other/path/to/foo_baz.csv']
presorted_list = ['foo_baz', 'foo']

only_sorted_elements = filter(lambda x:x.rpartition("/")[-1].partition("(")[0].partition(".")[0] in presorted_list , some_list)
unsorted_all = filter(lambda x:x.rpartition("/")[-1].partition("(")[0].partition(".")[0] not in presorted_list , some_list)
only_sorted_elements.sort(key = lambda x:presorted_list.index(x.rpartition("/")[-1].partition("(")[0].partition(".")[0]))
expected_list = only_sorted_elements + unsorted_all
print expected_list

Result :

['path/to/foo_baz.csv', 
'other/path/to/foo_baz.csv', 
'path/to/foo(ignore_this).csv', 
'path/to/bar_foo.csv', 
'path/to/foo_bar(ignore_this).csv']

Thank you for your suggestion! Unfortunately I can neither get your example to work with python 3.6.5 `AttributeError: 'filter' object has no attribute 'sort'` nor python 2.7.14 `ValueError: 'path/to/foo_baz.csv' is not in list` — coreuter, Apr 18 '18 at 10:32

Alain T. · Answer 4 · 2018-04-18T21:34:27.143

Since python's sort is already stable, you only need to provide it with a coarse grouping for the sort key.

Given the specifics of your sorting requirements this is better done using a function. For example:

def presort(presorted):        
    def sortkey(name):
        filename = name.split("/")[-1].split(".")[0].split("(")[0]
        if filename in presorted:
              return presorted.index(filename)
        return len(presorted)
    return sortkey

sorted_list = sorted(some_list,key=presort(['foo_baz', 'foo']))

In order to keep the process generic and simple to use, the presorted_list should be provided as a parameter and the sort key function should use it to produce the grouping keys. This is achieved by returning a function (sortkey) that captures the presorted list parameter.

This sortkey() function returns the index of the file name in the presorted_list or a number beyond that for unmatched file names. So, if you have 2 names in the presorted_list, they will group the corresponding files under sort key values 0 and 1. All other files will be in group 2.

The conditions that you use to determine which part of the file name should be found in presorted_list are somewhat unclear so I only covered the specific case of the opening parenthesis. Within the sortkey() function, you can add more sophisticated parsing to meet your needs.

Sorting a list based on another shorter list with incomplete values

4 Answers4