0

File Structure
I have a folder, called test_folder, which has several subfolders (named different fruit names, as you'll see in my code below) within it. In each subfolder, there is always a metadump.xml file where I am extracting information from.

Current Stance
I have been able to achieve this on an individual basis, where I specify the subfolder path.

import re

in_file = open("C:/.../Downloads/test_folder/apple/metadump.xml")
contents = in_file.read()
in_file.close()

title = re.search('<dc:title rsfieldtitle="Title" 
rsembeddedequiv="Name" rsfieldref="8" rsfieldtype="0">(.+?)</dc:title>', 
contents).group(1)
print(title)

Next Steps
I would like to perform the following function on a larger scale by simply referencing the parent folder C:/.../Downloads/test_folder and making my program find the xml file for each subfolder to extract the desired information, rather than individually specifying every fruit subfolder.

Clarification
Rather than simply obtaining a list of subfolders or a list of xml files within these subfolders, I would like physically access these subfolders to perform this text extraction function from each xml file within each subfolder.

Thanks in advance for your help.

Hanman004
  • 117
  • 1
  • 9
  • 1
    Possible duplicate of [Getting a list of all subdirectories in the current directory](https://stackoverflow.com/questions/973473/getting-a-list-of-all-subdirectories-in-the-current-directory) Everything you need to solve your problem is explained in the linked question – tnknepp Jun 06 '18 at 13:32
  • @tnknepp Please correct me if I'm wrong, but I believe this refers to simply obtaining a list of subdirectories. I believe this differs as I am finding these subfolders, but also opening them to access files within them. – Hanman004 Jun 06 '18 at 13:41
  • 1
    @Hanmann004: I see your clarification above, and this helps. It seems the key in solving your problem is identifying subfolders (the xml always has the same name), which is the same problem Brad Zeis was facing. Everything after that is straight forward. NOTE: you do not need to "open" a subdirectory, only the xml file within the directory (or am I missing something?). – tnknepp Jun 06 '18 at 13:54
  • @tnknepp Yes, regarding your note, you are right. As I've learned, it seems I do not need to actually "open" a subdirectory to accomplish this, thank you. – Hanman004 Jun 06 '18 at 15:23

4 Answers4

2

you can use os.listdir as the following:

import os
parent_folder = 'C:/.../Downloads/test_folder'
subfolders = os.listdir(parent_folder)
for subfolder in subfolders:
    in_file = open(parent_folder+'/'+ subfolder+'/metadump.xml')
    contents = in_file.read()
    in_file.close()
    title = re.search('<dc:title rsfieldtitle="Title" 
    rsembeddedequiv="Name" rsfieldref="8" rsfieldtype="0">(.+?)</dc:title>', 
    contents).group(1)
    print(title)
Walid Da.
  • 948
  • 1
  • 7
  • 15
2

You can do this by using glob module if you are not sure number of subfolders in your folder. recursive=True will make it to check for all subfolders in your folder C:/../Downloads/test_folder/ and gives you list of all the metadump.xml files

import re
import glob
for file in glob.glob("C:/**/Downloads/test_folder/**/metadump.xml", recursive=True):
    with open(file) as in_file:
        contents= in_file.read()
    title = re.search('<dc:title rsfieldtitle="Title" 
rsembeddedequiv="Name" rsfieldref="8" rsfieldtype="0">(.+?)</dc:title>', 
contents).group(1)
    print(title)
Murali
  • 364
  • 2
  • 11
  • 1
    This is interesting behavior of glob. I had no idea you could completely "wildcard" directories. Thanks for sharing. – tnknepp Jun 06 '18 at 13:39
  • @Murali Thank you for the solution. As a small edit, I believe your suggestion in line 3 should read *glob.glob* – Hanman004 Jun 06 '18 at 14:08
1

This might help you:

import os
for root, dirs, files in os.walk("/mydir"):
    for file in files:
        if file.endswith(".xml"):
            print(os.path.join(root, file))
Petronella
  • 2,327
  • 1
  • 15
  • 24
  • The output for this seems to be a list of my xml files in these subfolders. My ultimate goal is to *open* these subfolders to extract information from these xml files in these subfolders. – Hanman004 Jun 06 '18 at 13:44
1

You can use Python's os.walk() to traverse all of the subfolders. If the file is metadump.xml, it will open it and extract your title. The filename and the title is displayed:

import os

for root, dirs, files in os.walk(r"C:\...\Downloads\test_folder"):
    for file in files:
        if file == 'metadump.xml':
            filename = os.path.join(root, file) 

            with open(filename) as f_xml:
                contents = f_xml.read()
                title = re.search('<dc:title rsfieldtitle="Title" rsembeddedequiv="Name" rsfieldref="8" rsfieldtype="0">(.+?)</dc:title>', contents).group(1)
                print('{} : {}'.format(filename, title))
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • @Martin_Evans Thank you for the solution, this works as well. For my purposes, I will probably modify to simply print the *title*. Upvoted. – Hanman004 Jun 06 '18 at 15:26