I am iterating through a folder of files, to extract some text from an xml, and wish to keep track of which file each text match came from.
I am looking to put the filenames into the filename_master list. I think I may be over-complicating by using a regex (each filename has 14 digits.xml) but this isn't coming to me.
path = '/Users/Downloads/PDF/XML/'
read_files = glob.glob(os.path.join(path, '*.xml'))
filename_master=[]
text_master=[]
for file in read_files:
parse = ET.parse(file)
root = parse.getroot()
all_nodes = list(root.iter())
ls=[ele.text for ele in all_nodes if ele.findall('[@mark="1"]')]
my_exp = re.compile(r'.*(\d{14})\.xml')
name = my_exp.match(file).group(1)
filename_master.append(name)
text_master.append(ls)