BeautifulSoup XML Only printing first line

Question

I'm using BeautifulSoup4 (And lxml) to parse an XML file, for some reason when I print soup.prettify() it only prints the first line:

from bs4 import BeautifulSoup

f = open('xmlDoc.xml', "r")

soup = BeautifulSoup(f, 'xml')

print soup.prettify()

#>>> <?xml version="1.0" encoding="utf-8"?>

Any idea why it's not grabbing everything?

UPDATE:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<!-- Data Junction generated file.
Macro type "1000" is reserved. -->
<djmacros>
  <macro name="Test" type="5000" value="TestValue">
    <description>test</description>
  </macro>
  <macro name="AnotherTest" type="0" value="TestValue2"/>
  <macro name="TestLocation" type="1000" value="C:\RandomLocation">
    <description> </description>
  </macro>
<djmacros>

I'm having similar troubles. I suspect it's actually not capturing anything. (If you try your code on malformed XML I expect it will still return just the xml header). — chobok, Mar 23 '12 at 14:10
Hmm, I just tried cutting and pasting your xml. It seems to be working ok for me. What versions are you using? — chobok, Mar 23 '12 at 14:11

jfs · Accepted Answer · 2012-03-08T18:33:36.563

4

The file position is at EOF:

>>> soup = BeautifulSoup("", 'xml')
>>> soup.prettify()
'<?xml version="1.0" encoding="utf-8">\n'

Or the content is not valid xml:

>>> soup = BeautifulSoup("no <root/> element", 'xml')
>>> soup.prettify()
'<?xml version="1.0" encoding="utf-8">\n'

edited Mar 08 '12 at 18:33

answered Mar 08 '12 at 18:28

jfs

399,953
195
994
1,670

@moreisee: do `f.seek(0)` to rewind the file to the beginning. You might have already consumed it (with the code that you haven't shown). – jfs Mar 08 '12 at 18:45
That's all the python code that exists. Just getting my feet wet with BeautifulSoup. Edit: Tried it anyways, with no luck. – moreisee Mar 08 '12 at 18:46
@moreisee: read the file into a string `s = f.read()`. Inspect `repr(s)` to see if there is anything unusual ('\0' bytes; BOM mark despite utf-8 declaration; it should work with any line separator, but check what it is ('\r', '\n') anyway). – jfs Mar 08 '12 at 18:58
@J.F.Sebastion Nope, the file is using \n only. I also copy and pasted the xml into a string instead of reading from the file, same issue. – moreisee Mar 08 '12 at 19:03
@moreisee: the content is not empty on `bs4.__version__ == '4.0.0b8'` – jfs Mar 08 '12 at 19:08
@J.F.Sebastion I'm on '4.0.0b10' – moreisee Mar 08 '12 at 19:11
I wonder if this is an lxml issue, I've installed 2.2.8 for py2.7 via .exe (Windows :( ) But it was after installing bs4. – moreisee Mar 08 '12 at 19:16
@moreisee: It might be if `bs4` uses `lxml` tree builder. You could try it directly: `import lxml.etree as E; print(repr(E.tostring(E.fromstring(s))))` – jfs Mar 08 '12 at 19:49
I get the whole XML doc in a single string, does this rule out an issue with lxml? – moreisee Mar 08 '12 at 19:54
@moreisee: it means that `lxml` by itself works. Run [`test_bs4.py`](https://gist.github.com/6deb8175ed03647981c3) – jfs Mar 08 '12 at 20:16
1

@moreisee: have you tried to [click on the link](https://gist.github.com/6deb8175ed03647981c3)? – jfs Mar 08 '12 at 22:28
I just tried BeautifulSoup 3 and it worked. I also tried your link and it worked. Now to... to use 3 or 4. – moreisee Mar 08 '12 at 22:36
I just tried BeautifulSoup 3 and it worked. I also tried your link and it worked. Now to... to use 3 or 4. – moreisee Mar 08 '12 at 22:37
Awesome. Got it working, I think I'm going to go ahead with bs4. Thanks for all your help. It's odd that using repr is not in the documentation though. – moreisee Mar 08 '12 at 22:39

score 2 · Answer 2 · answered Mar 25 '22 at 08:48

2

I had the same problem with a valid XML file. The problem was that the XML file is encoded in UTF-8 with BOM.

I discovered that by printing the raw content:

content = open(path, "r").read()
print(content)

And I got (see this thread: What's ï»¿ sign at the beginning of my source file?):

ï»¿<?xml version="1.0" encoding="utf-8"?>

answered Mar 25 '22 at 08:48

matteogll

803
8
16

1

I was the same issue, changing the encoding to UTF-8 without BOM solved it. – Tariq M Nasim Jan 25 '23 at 23:52

score 1 · Answer 3 · answered Mar 23 '12 at 16:24

1

As per J.F.Sebastion's answer, the XML is invalid.

Your final tag is incorrect:

<djmacros>

The correct tag is:

</djmacros>

You can confirm this with an XML validator. Eg http://www.w3schools.com/xml/xml_validator.asp

answered Mar 23 '12 at 16:24

chobok

423
4
19

score 0 · Answer 4 · answered May 24 '18 at 20:16

0

If the encoding is UTF-8-BOM instead of UTF-8 it may have problems even if the XML is otherwise valid.

answered May 24 '18 at 20:16

Smith

1
1

BeautifulSoup XML Only printing first line

4 Answers4

Linked