3

I'm using BeautifulSoup4 (And lxml) to parse an XML file, for some reason when I print soup.prettify() it only prints the first line:

from bs4 import BeautifulSoup

f = open('xmlDoc.xml', "r")

soup = BeautifulSoup(f, 'xml')

print soup.prettify()

#>>> <?xml version="1.0" encoding="utf-8"?>

Any idea why it's not grabbing everything?

UPDATE:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<!-- Data Junction generated file.
Macro type "1000" is reserved. -->
<djmacros>
  <macro name="Test" type="5000" value="TestValue">
    <description>test</description>
  </macro>
  <macro name="AnotherTest" type="0" value="TestValue2"/>
  <macro name="TestLocation" type="1000" value="C:\RandomLocation">
    <description> </description>
  </macro>
<djmacros>
moreisee
  • 408
  • 8
  • 17
  • I'm having similar troubles. I suspect it's actually not capturing anything. (If you try your code on malformed XML I expect it will still return just the xml header). – chobok Mar 23 '12 at 14:10
  • Hmm, I just tried cutting and pasting your xml. It seems to be working ok for me. What versions are you using? – chobok Mar 23 '12 at 14:11

4 Answers4

4

The file position is at EOF:

>>> soup = BeautifulSoup("", 'xml')
>>> soup.prettify()
'<?xml version="1.0" encoding="utf-8">\n'

Or the content is not valid xml:

>>> soup = BeautifulSoup("no <root/> element", 'xml')
>>> soup.prettify()
'<?xml version="1.0" encoding="utf-8">\n'
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • @moreisee: do `f.seek(0)` to rewind the file to the beginning. You might have already consumed it (with the code that you haven't shown). – jfs Mar 08 '12 at 18:45
  • That's all the python code that exists. Just getting my feet wet with BeautifulSoup. Edit: Tried it anyways, with no luck. – moreisee Mar 08 '12 at 18:46
  • @moreisee: read the file into a string `s = f.read()`. Inspect `repr(s)` to see if there is anything unusual ('\0' bytes; BOM mark despite utf-8 declaration; it should work with any line separator, but check what it is ('\r', '\n') anyway). – jfs Mar 08 '12 at 18:58
  • @J.F.Sebastion Nope, the file is using \n only. I also copy and pasted the xml into a string instead of reading from the file, same issue. – moreisee Mar 08 '12 at 19:03
  • @moreisee: the content is not empty on `bs4.__version__ == '4.0.0b8'` – jfs Mar 08 '12 at 19:08
  • @J.F.Sebastion I'm on '4.0.0b10' – moreisee Mar 08 '12 at 19:11
  • I wonder if this is an lxml issue, I've installed 2.2.8 for py2.7 via .exe (Windows :( ) But it was after installing bs4. – moreisee Mar 08 '12 at 19:16
  • @moreisee: It might be if `bs4` uses `lxml` tree builder. You could try it directly: `import lxml.etree as E; print(repr(E.tostring(E.fromstring(s))))` – jfs Mar 08 '12 at 19:49
  • I get the whole XML doc in a single string, does this rule out an issue with lxml? – moreisee Mar 08 '12 at 19:54
  • @moreisee: it means that `lxml` by itself works. Run [`test_bs4.py`](https://gist.github.com/6deb8175ed03647981c3) – jfs Mar 08 '12 at 20:16
  • 1
    @moreisee: have you tried to [click on the link](https://gist.github.com/6deb8175ed03647981c3)? – jfs Mar 08 '12 at 22:28
  • I just tried BeautifulSoup 3 and it worked. I also tried your link and it worked. Now to... to use 3 or 4. – moreisee Mar 08 '12 at 22:36
  • I just tried BeautifulSoup 3 and it worked. I also tried your link and it worked. Now to... to use 3 or 4. – moreisee Mar 08 '12 at 22:37
  • Awesome. Got it working, I think I'm going to go ahead with bs4. Thanks for all your help. It's odd that using repr is not in the documentation though. – moreisee Mar 08 '12 at 22:39
2

I had the same problem with a valid XML file. The problem was that the XML file is encoded in UTF-8 with BOM.

I discovered that by printing the raw content:

content = open(path, "r").read()
print(content)

And I got (see this thread: What's  sign at the beginning of my source file?):

<?xml version="1.0" encoding="utf-8"?>

matteogll
  • 803
  • 8
  • 16
1

As per J.F.Sebastion's answer, the XML is invalid.

Your final tag is incorrect:

<djmacros>

The correct tag is:

</djmacros>

You can confirm this with an XML validator. Eg http://www.w3schools.com/xml/xml_validator.asp

chobok
  • 423
  • 4
  • 19
0

If the encoding is UTF-8-BOM instead of UTF-8 it may have problems even if the XML is otherwise valid.

Smith
  • 1
  • 1