1

Couldn't find the answer anywhere. I have an XML:

<channel>
    <title>xxx</title>
    <description>aaa</description>
    <item>
        <title>theTitle</title>
        <link/>link
        </item>
        <title>theTitle2</title>
        <link/>link
        </item>

And I need to extract all the links from that file.

I iterate:

for link in soup.channel.findAll('item'):
    links = link.link
    linkdict.append(links)

But the output is:

[<link/>, <link/>, <link/>]

How can I parse this xml with/without using regex. I want the code to be as simple as it could be.


UPDATE

I've found the way to do it in one line of code:

soup = bs4.BeautifulSoup(output, features='xml')
Igor Hwang
  • 72
  • 7

2 Answers2

0

install Xml using this - pip install lxml And then you can easily parse using

 soup = BeautifulSoup(xmlString,"lxml")
keshaw
  • 1,215
  • 2
  • 10
  • 13
  • Thank you, I did that. XML in the question was received by `soup = bs4.BeautifulSoup(output, 'lxml')`. @user2420450 – Igor Hwang Apr 01 '16 at 09:27
0

Given that you have lxml installed, you can use it directly instead of via BeautifulSoup. In lxml tree model, the link texts will be available as tail of the corresponding <link/> elements :

from lxml import etree

raw = '''<channel> 
  <title>xxx</title>  
  <description>aaa</description>  
  <item> 
    <title>theTitle</title>  
    <link/>link
  </item>  
  <item> 
    <title>theTitle2</title>  
    <link/>link
  </item> 
</channel>'''

root = etree.fromstring(raw)
for link in root.xpath('//item/link'):
    print link.tail.strip()

output :

link
link

XPath expression //item/link means find item element, anywhere in current document, and return corresponding child element link. It is also worth mentioning that lxml is known to be faster than BS4 in most case.

References :
1) BeautifulSoup 4 Benchmark
2) Parsing HTML in python - lxml or BeautifulSoup? Which of these is better for what kinds of purposes?

Community
  • 1
  • 1
har07
  • 88,338
  • 12
  • 84
  • 137
  • Thank you, it worked with a bit of editing. But when I did it, I've found the way to do it with bs4 only and in one line. Updated my own question. – Igor Hwang Apr 02 '16 at 05:18