2

I obtained a list of all files in a folder using glob:

lista = glob.glob("*.h5")

The list basically contains files with names like:

   abc_000000000_000.h5
   abc_000000000_001.h5
   abc_000000000_002.h5
   ......
   abc_000000000_011.h5
   ......
   abc_000000001_000.h5
   abc_000000001_001.h5
   abc_000000001_002.h5
   ....
   abc_000000026_000.h5
   abc_000000026_001.h5
   ....
   abc_000000027_000.h5
   ....
   abc_000000027_011.h5

which has a format abc_0*_0*.h5. How do I reshape this into a list of lists? The inner list would be ['abc_000000027_0*.h5'] and the outer list would be the sequence of the 'abc_000000*' i.e first wildcard.

One way to create an input would be:

 lista=[]
 for i in range(115):
     for j in range(14):
         item="abc_%0.9d_%0.3d"%(i,j)
         lista.append(item)

My attempt: my solution is not nice and ugly.

     listb = glob.glob("*_011.h5")
     then for each item in listb split and glob again, for example
     listc = glob.glob("abc_000000027*.h5")
wander95
  • 1,298
  • 1
  • 15
  • 22
  • The break in each sub list is based on what? Better example == better answer... – dawg Jul 13 '21 at 14:56
  • The break is based on the two wildcards. Il'' add it to the question – wander95 Jul 13 '21 at 15:00
  • Do you mean that your files are of the format `abc_xxxxxxxx_yyy.h5` and you want a dict with `abc_xxxxxxxx` as the key with the value being a list containing the full file names that match the pattern `abc_xxxxxxxx_*.h5`? Please [edit] your question to provide sample input (file names) and output – Pranav Hosangadi Jul 13 '21 at 15:10

1 Answers1

1

Given:

ls -1
abc_00000001_1.h5
abc_00000001_2.h5
abc_00000001_3.h5
abc_00000002_1.h5
abc_00000002_2.h5
abc_00000002_3.h5
abc_00000003_1.h5
abc_00000003_2.h5
abc_00000003_3.h5

You can use pathlib, itertools.groupby and natural sorting to achieve this:

from pathlib import Path 
from itertools import groupby
import re 

p=Path('/tmp/t2')

def _k(s):
    s=str(s)
    try:
        return tuple(map(int, re.search(r'_(\d+)_(\d*)', s).groups()))
    except ValueError:
        return (0,0)
    
def k1(s):
    return _k(s)

def k2(s):
    return _k(s)[0]

result=[]   
files=sorted(p.glob('abc_000000*.h5'), key=k1)
for k,g in groupby(files, key=k2):
    result.append(list(map(str, g)))

Which could be simplified to:

def _k(p):
    try:
        return tuple(map(int, p.stem.split('_')[-2:]))
    except ValueError:
        return (0,0)
    
files=sorted(p.glob('abc_000000*_*.h5'), key=lambda e: _k(e))
result=[list(map(str, g)) for k,g in groupby(files, key=lambda e: _k(e)[0])]

Result (in either case):

>>> result
[['/tmp/t2/abc_00000001_1.h5', '/tmp/t2/abc_00000001_2.h5', '/tmp/t2/abc_00000001_3.h5'], ['/tmp/t2/abc_00000002_1.h5', '/tmp/t2/abc_00000002_2.h5', '/tmp/t2/abc_00000002_3.h5'], ['/tmp/t2/abc_00000003_1.h5', '/tmp/t2/abc_00000003_2.h5', '/tmp/t2/abc_00000003_3.h5']]

Which easily could be a dict:

>>> {k:list(map(str, g)) for k,g in groupby(files, key=k2)}
{1: ['/tmp/t2/abc_00000001_1.h5', '/tmp/t2/abc_00000001_2.h5', '/tmp/t2/abc_00000001_3.h5'], 
 2: ['/tmp/t2/abc_00000002_1.h5', '/tmp/t2/abc_00000002_2.h5', '/tmp/t2/abc_00000002_3.h5'], 
 3: ['/tmp/t2/abc_00000003_1.h5', '/tmp/t2/abc_00000003_2.h5', '/tmp/t2/abc_00000003_3.h5']}
dawg
  • 98,345
  • 23
  • 131
  • 206