1

I have a list of strings in python consisting of various filenames, like this (but much longer):

all_templates = ['fitting_file_expdisk_cutout-IMG-HSC-I-18115-6,3-OBJ-NEP175857.9+655841.2.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-3,3-OBJ-NEP180508.6+655617.3.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-1,8-OBJ-NEP180840.8+665226.2.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,7-OBJ-NEP175927.6+664230.2.feedme', 'fitting_file_expdisk_cutout-IMG-HSC-I-18114-0,5-OBJ-zsel56238.feedme', 'fitting_file_devauc_cutout-IMG-HSC-I-18114-0,3-OBJ-NEP175616.1+660601.5.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,4-OBJ-zsel56238.feedme']

I'd like to create multiple smaller lists for elements that have the same object name (the substring starting with OBJ- and ending right before .feedme). So I'd have a list like this:

obj1 = ['fitting_file_expdisk_cutout-IMG-HSC-I-18114-0,5-OBJ-zsel56238.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,4-OBJ-zsel56238.feedme'],

and so on for other matching 'objects'. In reality I have over 900 unique 'objects', and the original list all_templates has over 4000 elements because each object has 3 or more separate template files (which are all appearing in a random order to start). So in the end I'll want to have over 900 lists (one per object). How can I do this?

Edit: Here is what I tried, but it is giving me a list of ALL the original template filenames inside each sublist (which are each supposed to be unique for one object name).

import re
# Break up list into multiple lists according to substring (object name)
obj_list = [re.search(r'.*(OBJ.+)\.feedme', filename)[1] for filename in all_template_files]
obj_list = list(set(obj_list)) # create list of unique objects (remove duplicates)

templates_objs_sorted = [[]]*len(obj_list)
for i in range(len(obj_list)):
    for template in all_template_files:
        if obj_list[i] in template:
            templates_objs_sorted[i].append(template)
curious_cosmo
  • 1,184
  • 1
  • 18
  • 36

3 Answers3

1
from collections import defaultdict
from pprint import pprint

all_templates = ['fitting_file_expdisk_cutout-IMG-HSC-I-18115-6,3-OBJ-NEP175857.9+655841.2.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-3,3-OBJ-NEP180508.6+655617.3.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-1,8-OBJ-NEP180840.8+665226.2.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,7-OBJ-NEP175927.6+664230.2.feedme', 'fitting_file_expdisk_cutout-IMG-HSC-I-18114-0,5-OBJ-zsel56238.feedme', 'fitting_file_devauc_cutout-IMG-HSC-I-18114-0,3-OBJ-NEP175616.1+660601.5.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,4-OBJ-zsel56238.feedme']

# simple helper function to extract the common object name
# you could probably use Regex... but then you'd have 2 problems
def objectName(path):
    start = path.index('-OBJ-')
    stop = path.index('.feedme')
    return path[(start + 5):stop]

# I really wanted to use a one line reduce here, but... 
grouped = defaultdict(list)
for each in all_templates:
    grouped[objectName(each)].append(each)
pprint(grouped)

ASIDE/TANGENT

OK, it really bugged me that I couldn't do a simple one liner using reduce there. Ultimately, I wish python had a good groupby function. It has a function by that name, but it's limited to consecutive keys. Smalltalk, Objc, and Swift all have groupby mechanisms which basically allow you to bucketize an utterable by an arbitrary transfer function.

My initial attempt looked like:

grouped = reduce(
    lambda accum, each: accum[objectName(each)].append(each),
    all_templates,
    defaultdict(list))

The problem is the lambda. A lambda is limited to a single expression. And for it to work in reduce, it most return a modified version of the accumulated argument. But python doesn't like to return things from functions/methods unless it has to. Even if we replaced the append with <accessTheCurrentList> + [each], we'd need a dictionary modifying method that updated the value at a key and returned the modified dictionary. I could not find such a thing.

However, what we can do is load more information into our accumulator, for example, a tuple. We can use one slot of the tuple to keep passing the defaultdict pointer along, and the other to catch the unhelpful None return of the modifying operation. It ends up pretty ugly, but it is a one liner:

from functools import reduce
grouped = reduce(
    lambda accum, each: (accum[0], accum[0][objectName(each)].append(each)),
    all_templates,
    (defaultdict(list), None))[0]
Travis Griggs
  • 21,522
  • 19
  • 91
  • 167
0

You can group a sorted list:

from itertools import groupby
import re

all_templates = ['fitting_file_expdisk_cutout-IMG-HSC-I-18115-6,3-OBJ-NEP175857.9+655841.2.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-3,3-OBJ-NEP180508.6+655617.3.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-1,8-OBJ-NEP180840.8+665226.2.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,7-OBJ-NEP175927.6+664230.2.feedme', 'fitting_file_expdisk_cutout-IMG-HSC-I-18114-0,5-OBJ-zsel56238.feedme', 'fitting_file_devauc_cutout-IMG-HSC-I-18114-0,3-OBJ-NEP175616.1+660601.5.feedme', 'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,4-OBJ-zsel56238.feedme']

pattern = re.compile(r'OBJ-.*?\.feedme$')
objs = {name: pattern.search(name)[0] for name in all_templates}
result = [list(g) for k, g in groupby(sorted(all_templates, key=objs.get), key=objs.get)]

print(result)

Output:

[['fitting_file_devauc_cutout-IMG-HSC-I-18114-0,3-OBJ-NEP175616.1+660601.5.feedme'],
 ['fitting_file_expdisk_cutout-IMG-HSC-I-18115-6,3-OBJ-NEP175857.9+655841.2.feedme'],
 ['fitting_file_sersic_cutout-IMG-HSC-I-18115-6,7-OBJ-NEP175927.6+664230.2.feedme'],
 ['fitting_file_sersic_cutout-IMG-HSC-I-18115-3,3-OBJ-NEP180508.6+655617.3.feedme'],
 ['fitting_file_sersic_cutout-IMG-HSC-I-18115-1,8-OBJ-NEP180840.8+665226.2.feedme'],
 ['fitting_file_expdisk_cutout-IMG-HSC-I-18114-0,5-OBJ-zsel56238.feedme',
  'fitting_file_sersic_cutout-IMG-HSC-I-18115-6,4-OBJ-zsel56238.feedme']]
iz_
  • 15,923
  • 3
  • 25
  • 40
  • I was going to use `groupby` until I remembered that python's variant is only sequential. Wish there was an alternative that wasn't sequential. – Travis Griggs Jan 24 '19 at 00:36
0

Using regular expression methods, so it requires

import re

Given the list of filenames, I customized it to show the result:

all_templates = ['aaa-OBJ-NEP175857.9+655841.2.feedme',
                 'bbb-OBJ-NEP175857.9+655841.2.feedme',
                 'ccc-OBJ-NEP175857.9+655841.2.feedme',
                 'ddd-OBJ-whathever.feedme',
                 'eee-OBJ-whathever.feedme',
                 'fff-SUBJ-whathever.feedme',
                 'fff-OBJ.feedme'
                ]

This can be an option:

result = {}
for filename in all_templates:
  match = re.search('OBJ-(.+?).feedme', filename)
  if match:
    result.setdefault(match.group(1), list()).append(filename)
  else:
    result.setdefault('no-match', list()).append(filename)

It uses the substring between OBJ- and .feedme as the key of a dict, appending each filenames that has the same substring. If there is no match it uses 'no-match' for appending the substring not matching the search.

So, it returns:

print(result)
# {'NEP175857.9+655841.2': ['aaa-OBJ-NEP175857.9+655841.2.feedme', 'bbb-OBJ-NEP175857.9+655841.2.feedme', 'ccc-OBJ-NEP175857.9+655841.2.feedme'],
#  'whathever': ['ddd-OBJ-whathever.feedme', 'eee-OBJ-whathever.feedme'],
#  'no-match': ['fff-SUBJ-whathever.feedme', 'fff-OBJ.feedme']}

If you require just the list of groups:

list(result.values())
iGian
  • 11,023
  • 3
  • 21
  • 36