0

Python Version: 3.7.2

Here is a xml-file.

<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:ls="https://www.littlstar.com">
<channel>
    <title><![CDATA[Name (46 videos)]]></title>
    <description><![CDATA[Name]]></description>
    <link>http://github.com/dylang/node-rss</link>
    <image>
        <url>http://1.1.1.1:3001/thumb\324bfb0834915ccc0edb73b5bf0b82c2.jpeg</url>
        <title>bla bla</title>
        <link></link>
    </image>
    <generator>RSS for Node</generator>
    <lastBuildDate>Sat, 23 Feb 2019 11:32:08 +0000</lastBuildDate>
    <category><![CDATA[Local]]></category>
....

and here is source code. it's very simple

f = open(xmlpath, 'r')
data = f.read()
f.close()

soup = BeautifulSoup(data, 'html.parser')
print(soup.prettify())

and... result is

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ls="https://www.littlstar.com">
 <channel>
  <title><![CDATA[Name (46 videos)]]></title>
  <description><![CDATA[Name]]></description>
  <link/>
  http://github.com/dylang/node-rss
  <image/>
<url>http://192.168.1.142:3001/thumb\324bfb0834915ccc0edb73b5bf0b82c2.jpeg</url>
...

I lost "link" and "image" tags.... How can I solve this problem?

I tried upgrade bs, and using lxml parsing module...

1 Answers1

0

The .read() and .close() components are not required here.

simply

with open(xmlpath) as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    print(soup.prettify())
mhjr
  • 1
  • 1