0

I am iterating through a folder of files, to extract some text from an xml, and wish to keep track of which file each text match came from.

I am looking to put the filenames into the filename_master list. I think I may be over-complicating by using a regex (each filename has 14 digits.xml) but this isn't coming to me.

path = '/Users/Downloads/PDF/XML/'
read_files = glob.glob(os.path.join(path, '*.xml'))

filename_master=[]
text_master=[]

for file in read_files:
    parse = ET.parse(file)

    root = parse.getroot()
    all_nodes = list(root.iter())
    ls=[ele.text for ele in all_nodes if ele.findall('[@mark="1"]')]
    
    
    my_exp = re.compile(r'.*(\d{14})\.xml') 
    name = my_exp.match(file).group(1) 
    
    filename_master.append(name)
    text_master.append(ls)
Prolle
  • 358
  • 1
  • 10

1 Answers1

1

If you are sure that every file has 14 digits, you may

    name = file[-18:-4]
    filename_master.append(name)

or if you are in linux environment (where "/" is path seperator):

    name = file.split('/')[-1][:-4]
    filename_master.append(name)

or better:

    name = os.path.basename(file)[:-4]
    filename_master.append(name)

but using regex is fine IMHO.

armamut
  • 1,087
  • 6
  • 14