Load a list of lists from a file based on a grouping variable?

Question

If I have the file:

A pgm1
A pgm2
A pgm3
Z pgm4
Z pgm5
C pgm6
C pgm7
C pgm8
C pgm9

How can I create the list:

[['pgm1','pgm2','pgm3'],['pgm4','pgm5'],['pgm6','pgm7','pgm8','pgm9']]

I need to retain the original order from the load file. So [pgm4, pgm5] must be the 2nd sublist.

My preference is that the new sub-list is triggered when the grouping variable changes from the previous one, thus "A, Z, C". But I can accept if the grouping variable must be sequential, i.e. "1, 2, 3".

(This is to support running the programs in each sub-list concurrently, but waiting for all upstream programs to finish before proceeding to the next list.)

I'm on RHEL 2.6.32 using Python 2.6.6

I conducted web searches and searched SO for "python file list of lists" for over an hour before posting. What stumped me was how to detect when the group changed. Having said that, in future I will do my best to provide example code I've tried as part of all SO posts. — Scott, Nov 17 '15 at 00:16

Remi Guan · Answer 1 · 2015-11-16T08:24:51.587

Simply use collections.defaultdict().

Code:

import collections
d = collections.defaultdict(list)

infile = 'filename'
with open(infile) as f:
    a = [i.strip() for i in f]

a = [i.split() for i in a]

for key, value in a:
    d[key].append(value)

l = list(d.values())

Demo:

>>> import collections
>>> d = collections.defaultdict(list)

>>> infile = 'filename'
>>> with open(infile) as f:
...     a = [i.strip() for i in f]

>>> a = [i.split() for i in a]
>>> a
[['A', 'pgm1'], ['A', 'pgm2'], ['A', 'pgm3'], ['Z', 'pgm4'], ['Z', 'pgm5'], ['C', 'pgm6'], ['C', 'pgm7'], ['C', 'pgm8'], ['C', 'pgm9']]

>>> for key, value in a:
...     d[key].append(value)

>>> d
defaultdict(<class 'list'>, {'A': ['pgm1', 'pgm2', 'pgm3'], 'C': ['pgm6', 'pgm7', 'pgm8', 'pgm9'], 'Z': ['pgm4', 'pgm5']})

>>> d.values()
dict_values([['pgm1', 'pgm2', 'pgm3'], ['pgm6', 'pgm7', 'pgm8', 'pgm9'], ['pgm4', 'pgm5']])

>>> list(d.values())
[['pgm1', 'pgm2', 'pgm3'], ['pgm6', 'pgm7', 'pgm8', 'pgm9'], ['pgm4', 'pgm5']]
>>>

The blow code do the same thing as the above code does, but keep the order:

infile = 'filename'
with open(infile) as f:
    a = [i.strip() for i in f]

a = [i.split() for i in a]

def orderset(seq):
    seen = set()
    seen_add = seen.add
    return [ x for x in seq if not (x in seen or seen_add(x))]

l = []
for i in orderset([i[0] for i in a]):
    l.append([j[1] for j in a if j[0] == i])

I need to retain the original order from the load file. So pgm4, pgm5 need to be the 2nd sublist. I'm on RHEL 2.6.32 with Python 2.6.6 so I don't have OrderedDict. — Scott, Nov 16 '15 at 07:33

score 0 · Accepted Answer · edited May 23 '17 at 12:31

After my OP, additional web searches found this: How do I use Python's itertools.groupby()?

Here is my current approach. Please advise if I can make it more Pythonic.

loadfile1.txt (no grouping variable - same output as loadfile4.txt):

pgm1
pgm2
pgm3

pgm4
pgm5

pgm6
pgm7
pgm8
/a/path/with spaces/pgm9

loadfile2.txt (random grouping variable):

10, pgm1
10, pgm2
10, pgm3

ZZ, pgm4
ZZ, pgm5

-5, pgm6
-5, pgm7
-5, pgm8
-5, /a/path/with spaces/pgm9

loadfile3.txt (same grouping variable - no dependencies - multi-threaded):

,pgm1
,pgm2
,pgm3

,pgm4
,pgm5

,pgm6
,pgm7
,pgm8
,/a/path/with spaces/pgm9

loadfile4.txt (different grouping variable - dependencies - single threaded):

1, pgm1
2, pgm2
3, pgm3

4, pgm4
5, pgm5

6, pgm6
7, pgm7
8, pgm8
9, /a/path/with spaces/pgm9

My Python script:

#!/usr/bin/python

# See https://stackoverflow.com/questions/4842057/python-easiest-way-to-ignore-blank-lines-when-reading-a-file

# convert file to list of lines, ignoring any blank lines
filename = 'loadfile2.txt'

with open(filename) as f_in:
    lines = filter(None, (line.rstrip() for line in f_in))

print(lines)

# convert list to a list of lists split on comma
lines = [i.split(',') for i in lines]
print(lines)

# create list of lists based on the key value (first item in sub-lists)
listofpgms = []
for key, group in groupby(lines, lambda x: x[0]):
    pgms = []
    for pgm in group:
        try:
            pgms.append(pgm[1].strip())
        except IndexError:
            pgms.append(pgm[0].strip())

    listofpgms.append(pgms)

print(listofpgms)

Output when using loadfile2.txt:

['10, pgm1', '10, pgm2', '10, pgm3', 'ZZ, pgm4', 'ZZ, pgm5', '-5, pgm6', '-5, pgm7', '-5, pgm8', '-5, /a/path/with spaces/pgm9']
[['10', ' pgm1'], ['10', ' pgm2'], ['10', ' pgm3'], ['ZZ', ' pgm4'], ['ZZ', ' pgm5'], ['-5', ' pgm6'], ['-5', ' pgm7'], ['-5', ' pgm8'], ['-5', ' /a/path/with spaces/pgm9']]
[['pgm1', 'pgm2', 'pgm3'], ['pgm4', 'pgm5'], ['pgm6', 'pgm7', 'pgm8', '/a/path/with spaces/pgm9']]

Load a list of lists from a file based on a grouping variable?

2 Answers2