Extracting data from an XML

Question

I've looked at several examples and haven't been able to edit one to fit my needs.. I'm trying to extract the maker and model tags from a file but no matter what previously answered question I find I can't get it to work for me.

Edit- It's probably not different. What's different is my level of understanding of python. Trying to edit the scripts provided in the different answers already on Stack, I've been unable to successfully get the thing to work.

<camera>
   <maker>Fujifilm</maker>
    <model>GFX 50S</model>
    <mount>Fujifilm G</mount>
    <cropfactor>0.79</cropfactor>
</camera>

Look for `BeautifulSoup` library. On the internet there are plenty of documentation — Andrej Kesely, Jul 25 '18 at 18:34
IT very well may be but I couldn't figure out how to properly edit it to get the results I needed. Was hoping someone could help provide the specifics....My knowledge of python is at a beginner level but I need to do this for a proposal and don't want to just copy/paste them all. — RobertL78, Jul 25 '18 at 18:35
I hate to lose rep over this but it's just something I don't understand and trying ot edit the different solutions posted have yielded no results. — RobertL78, Jul 25 '18 at 18:36
try xmltodict import xmltodict with open('c:\\temp\data.xml') as fd: doc = xmltodict.parse(fd.read()) print(doc['camera']['maker']) print(doc['camera']['model']) https://docs.python-guide.org/scenarios/xml/ — Any Moose, Jul 25 '18 at 18:48

score 0 · Answer 1 · answered Jul 25 '18 at 18:36

0

Take a look at the python docs.

import xml.etree.ElementTree as ET

root = ET.fromstring(xml_string)
maker = root.findtext('maker')
model = root.findtext('model')

answered Jul 25 '18 at 18:36

Jesse Bakker

2,403
13
25

Steven M · Accepted Answer · 2018-07-27T17:45:55.827

0

try bs4...?

from bs4 import BeautifulSoup

page = '''
        <camera>
            <maker>Fujifilm</maker>
            <model>GFX 50S</model>
            <mount>Fujifilm G</mount>
            <cropfactor>0.79</cropfactor>
        </camera>
        '''

soup = BeautifulSoup(page, 'lxml')
make = soup.find('maker')
model = soup.find('model')
print(f'Make: {make.text}\nModel: {model.text}')

for multiple entries, just loop through them with find_all()

from bs4 import BeautifulSoup

page = '''
        <camera>
            <maker>Fujifilm</maker>
            <model>GFX 50S</model>
            <mount>Fujifilm G</mount>
            <cropfactor>0.79</cropfactor>
        </camera>
        <camera>
            <maker>thing1</maker>
            <model>thing2</model>
            <mount>Fujifilm G</mount>
            <cropfactor>0.79</cropfactor>
        </camera>
        <camera>
            <maker>thing3</maker>
            <model>thing4</model>
            <mount>Fujifilm G</mount>
            <cropfactor>0.79</cropfactor>
        </camera>
        <camera>
            <maker>thing5</maker>
            <model>thing6</model>
            <mount>Fujifilm G</mount>
            <cropfactor>0.79</cropfactor>
        </camera>
        '''

soup = BeautifulSoup(page, 'lxml')
make = soup.find_all('maker')
model = soup.find_all('model')
for x, y in zip(make, model):
    print(f'Make: {x.text}\nModel: {y.text}')

getting data through a file:

from bs4 import BeautifulSoup

with open('path/to/your/file') as file:
    page = file.read()
    soup = BeautifulSoup(page, 'lxml')
    make = soup.find_all('maker')
    model = soup.find_all('model')
    for x, y in zip(make, model):
        print(f'Make: {x.text}\nModel: {y.text}')

without importing any modules:

with open('/PATH/TO/YOUR/FILE') as file:

    for line in file:
        for each in line.split():
            if "maker" in each:
                each = each.replace("<maker>", "")
                print(each.replace("</maker>", ""))

this is for the 'maker' tag only, it might be beneficial to split these up into separate definitions and iterate through them

edited Jul 27 '18 at 17:45

answered Jul 25 '18 at 18:38

Steven M

204
1
4
13

would this work for a file filled with those types of entries? or would I just have to past all the entries inside those quote marks? – RobertL78 Jul 25 '18 at 18:49
you would need the find_all() function and then you would iterate over all of them – Steven M Jul 25 '18 at 18:50
yeah, use the "with open()" function... try the last bit of code i posted, just replace the 'path/to/your/file' but keep the quotes – Steven M Jul 25 '18 at 18:59
were you able to get this to work? @RobertL78 – Steven M Jul 25 '18 at 20:18
I have not had 2 minutes to try yet. I'll be trying over the weekend.. Thank you. – RobertL78 Jul 26 '18 at 19:12
So I found some time to try this and keep getting the following error. Traceback (most recent call last): File "/Users/RobertL/Documents/extract-xml.py", line 5, in soup = BeautifulSoup(page, 'xml') File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/__init__.py", line 165, in __init__ % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library? I've installed beautifulsoup4 and lxml along with msgpack, and six==1.10.0 per other threads. – RobertL78 Jul 26 '18 at 19:45
using the last method to extract from a file. – RobertL78 Jul 26 '18 at 19:48
Did you do: “pip install bs4”? – Steven M Jul 26 '18 at 19:49
I did "pip3 install beautifulsoup4" but just now tried "pip install bs4" but I still get the same error. Both seems to install something and not say it was already installed. – RobertL78 Jul 26 '18 at 19:52
Hmm... strange, bs4 is the package, try: “pip3 install bs4 -U” – Steven M Jul 26 '18 at 19:57
Installing collected packages: bs4 Running setup.py install for bs4 ... done Successfully installed bs4-0.0.1 Same error when running the script though. It should just work without having to reboot or anything I assume. – RobertL78 Jul 26 '18 at 19:58
No reboot, how are you executing the script, through an IDE, IDLE, etc? – Steven M Jul 26 '18 at 20:00
Yes I'm running it through Idle but in a new window saved with a file name then F5 to run – RobertL78 Jul 26 '18 at 20:00
That seems to be the problem. I needed to run it from terminal... running from IDLE just kept failing. – RobertL78 Jul 26 '18 at 20:01
Oh I see, line 5 needs to be... soup = BeautifulSoup(page, “html.parser”) or soup = BeautifulSoup(page, “lxml”) – Steven M Jul 26 '18 at 20:04
it actually worked without that edit when I ran through terminal. I installed the lxml package. just had to add an extra print() statement to separate the output a bit. – RobertL78 Jul 26 '18 at 20:04
Nice! Glad everything worked out – Steven M Jul 26 '18 at 20:07
Would it be possible if you don't mind.. to show how to do this without having to import any modules? I'm wondering really if python can do this only with what's included even if the code is longer and less elegant. I'm in an intro to python class and this isn't homework. I just noticing that seasoned people always import modules to make things simpler but I want to know how to do it without modules if at all possible. If you don't fee like it that's okay but I thought I'd ask. – RobertL78 Jul 27 '18 at 15:03
I gotcha, so I left the 'model' tag untouched, but the 'maker' tag works and filters fine, it might be beneficial for you to look into 'python generators' and create 2 separate functions with this code. I think that will give you the desired outcome – Steven M Jul 27 '18 at 17:48

Extracting data from an XML

2 Answers2