Change list into list of lists based on location of subset

Question

I am working with a list of filenames. There are no duplicates and the list is sorted.

The list can be grouped into subsets. Files with a _0001 in the name indicate the start of a new subset. Then _0002 is the 2nd item in the subset, and so on. I would like to transform this flat list into a hierarchical list of lists.

Here is an example of the original, flat list:

['Log_03-22-2016_12-06-18_GMT_0001.log',
 'Log_03-22-2016_12-10-41_GMT_0002.log',
 'Log_03-22-2016_12-11-56_GMT_0003.log',
 'Log_03-22-2016_12-13-12_GMT_0004.log',
 'Log_03-22-2016_12-14-27_GMT_0005.log',
 'Log_03-22-2016_12-15-43_GMT_0006.log',
 'Log_03-22-2016_12-16-58_GMT_0007.log',
 'Log_03-23-2016_09-08-57_GMT_0001.log',
 'Log_03-23-2016_09-13-24_GMT_0002.log',
 'Log_03-23-2016_09-14-26_GMT_0003.log',
 'Log_03-23-2016_09-15-27_GMT_0004.log',
 'Log_03-23-2016_11-17-57_GMT_0001.log',
 'Log_03-23-2016_11-19-21_GMT_0002.log']

I would like to slice this into lists of subsets, using the presence of the _0001 to detect the beginning of a new subset. Then return a list of all the lists of subsets. Here is an example output, using the above input:

[['Log_03-22-2016_12-06-18_GMT_0001.log',
  'Log_03-22-2016_12-10-41_GMT_0002.log',
  'Log_03-22-2016_12-11-56_GMT_0003.log',
  'Log_03-22-2016_12-13-12_GMT_0004.log',
  'Log_03-22-2016_12-14-27_GMT_0005.log',
  'Log_03-22-2016_12-15-43_GMT_0006.log',
  'Log_03-22-2016_12-16-58_GMT_0007.log'],
 ['Log_03-23-2016_09-08-57_GMT_0001.log',
  'Log_03-23-2016_09-13-24_GMT_0002.log',
  'Log_03-23-2016_09-14-26_GMT_0003.log',
  'Log_03-23-2016_09-15-27_GMT_0004.log'],
 ['Log_03-23-2016_11-17-57_GMT_0001.log',
  'Log_03-23-2016_11-19-21_GMT_0002.log']]

Here is the current solution I have. It seems like there ought to be a more elegant and Pythonic way of doing this:

import glob

first_log_indicator = '_0001'

log_files = sorted(glob.glob('Log_*_GMT_*.log')) 

first_logs = [s for s in log_files if first_log_indicator in s]

LofL = []

if len(first_logs) > 1:
    for fl_idx, fl_name in enumerate(first_logs):
        start_slice = log_files.index(fl_name)
        if fl_idx + 1 < len(first_logs):
            stop_slice = log_files.index(first_logs[fl_idx+1])
            LofL.append(log_files[start_slice:stop_slice])
        else:
            LofL.append(log_files[start_slice:])
else:
    LofL.append(log_files)

I looked into itertools, and while I am admittedly unfamiliar with that module, I didn't see anything that quite did this.

The closest questions I could find on SO all had the sublists of fixed length. Here, the sublists are of arbitrary length. Others used the presence of a "separator" to delimit the sublists in the original (flat) list, and which ultimately get thrown out when making the list of lists. I do not have a separator in that sense, since I do not want to throw away any items in the original list.

Can anyone please suggest a better approach than what I have above?

I think this is what you are looking for: http://stackoverflow.com/questions/15357830/python-spliting-a-list-based-on-a-delimiter-word. Except that you need to apply the endswith check. — alecxe, Mar 31 '16 at 02:55

score 2 · Accepted Answer · answered Mar 31 '16 at 03:04

2

You could get the indices of the first in each series and then split the list as follows:

firsts = [i for i, v in enumerate(log_files) if '_0001' in v]
list_of_lists = [log_files[i:j] for i, j in zip(firsts, firsts[1:] + [None])]

answered Mar 31 '16 at 03:04

sp.

1,336
11
7

This is good, but technically if the list starts with something other than '0001' some values may be discarded. – hilberts_drinking_problem Mar 31 '16 at 03:15

score 1 · Answer 2 · answered Mar 31 '16 at 04:40

If elements always keep that pattern I would do something like:

prepared_data = ((element, element.split('.')[0].split('_')[-1]) for element in log_files)
final_logs = []
for element in prepared_data:
    if element[1] == '0001':
        final_logs.append([element[0]])
    else:
        final_logs[-1].append(element[0])
print final_logs

score 0 · Answer 3 · answered Mar 31 '16 at 03:13

I think @sp. has an elegant solution. Here is the blue collar method:

lst = ['Log_03-22-2016_12-06-18_GMT_0001.log',
'Log_03-22-2016_12-10-41_GMT_0002.log',
'Log_03-22-2016_12-11-56_GMT_0003.log',
'Log_03-22-2016_12-13-12_GMT_0004.log',
'Log_03-22-2016_12-14-27_GMT_0005.log',
'Log_03-22-2016_12-15-43_GMT_0006.log',
'Log_03-22-2016_12-16-58_GMT_0007.log',
'Log_03-23-2016_09-08-57_GMT_0001.log',
'Log_03-23-2016_09-13-24_GMT_0002.log',
'Log_03-23-2016_09-14-26_GMT_0003.log',
'Log_03-23-2016_09-15-27_GMT_0004.log',
'Log_03-23-2016_11-17-57_GMT_0001.log',
'Log_03-23-2016_11-19-21_GMT_0002.log']

lsts = []
buf = [lst[0]]
for l in lst[1:]:
    if l[-8:-4] == '0001':
        lsts.append(buf)
        buf = [l]
    else:
        buf.append(l)
lsts.append(buf)

Change list into list of lists based on location of subset

3 Answers3