0

I have a xml file, the content like

content ="""<?xml version="1.0" ?>
<passage>
  <title>Aggrecan Turnover</title>
  <author>Winsz-Szczotka K,Kuźnik-Trocha K,Komosińska-Vassev K,Jura-Półtorak A,Olczyk K</author>
  <source>Disease markers</source>
  <description>
   xxxxxxx
  </description>
  <filename>26924871.xml</filename>
  <passage_url>http://www.ncbi.nlm.nih.gov/pubmed/26924871</passage_url>
  <received_date>2016-03-02</received_date>
  <parameter_date>2016-02-29</parameter_date>
</passage>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(content, "xml")
soup.find("author")

Result:

On Windows:

<author>Winsz-Szczotka K,Kuźnik-Trocha K,Komosińska-Vassev K,Jura-Półtorak A,Olczyk K</author>

On Linux

Nothing find, When i change <author> node to <author>Winsz-Szczotka</author>, then it can find the node both Windows and Linux, So what make this happened?

Besides, when i change the parser to html.parser on Linux, it works well, i am confused, the content is xml format, why use html.parse work well? anybody can tell me something, thanks.

Spacedman
  • 92,590
  • 12
  • 140
  • 224
aodavid
  • 451
  • 1
  • 4
  • 9
  • Are you sure that your XML file is properly UTF-8 encoded? Double-check. If you created it on Windows, chances are that it is actually not UTF-8, but some ANSI encoding. Use a HEX-editor to find out the byte encoding of the character `ź`, for example. It should be `0xC5 0xBE`. Is it? – Tomalak Sep 03 '16 at 07:28
  • The code works fine for me on Linux and returns the node as with your Windows example when I cut and paste (note I've done some tiny edits to make it cut and pasteable). – Spacedman Sep 03 '16 at 07:29
  • @Spacedman That's why I suspect a byte encoding error. If you copy the code off the web site, it will probably work fine. – Tomalak Sep 03 '16 at 07:34
  • @Tomalak it is `Winsz-Szczotka K,Ku\xc5\xbanik-Trocha K,Komosi\xc5\x84ska-Vassev K,Jura-P\xc3\xb3\xc5\x82torak A,Olczyk K `on Linux, it seems like encoding error, but i don't know how to fix it, can you give me some advice? thx – aodavid Sep 03 '16 at 07:53
  • 2
    ah, i try to use `soup = BeautifulSoup(content.decode('utf-8'))`, it work fine, thx, @Tomalak @Spacedman – aodavid Sep 03 '16 at 08:02
  • Compare this question/answer and the comments below the answer, too. http://stackoverflow.com/questions/36144192/how-to-get-python-bs4-to-work-properly-on-xml – Tomalak Sep 03 '16 at 08:27
  • Generally, don't store any XML in your source code. Use separate files and file handles (`soup = BeautifulSoup(open("your.xml"), "xml")`) to parse them. Explicitly declare the XML byte encoding you are using``. UTF-8 is the default value. But explicit is better than implicit. – Tomalak Sep 03 '16 at 08:37

0 Answers0