0

I'd like to take a web-hosted xml podcast file and loop through, putting all the titles into a txt file matching the guid, ie abcd.mp3.txt (or abcd.txt) will contain This is the title

<rss xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" version="2.0">
<channel>
    <item>
    <title>This is the title</title>
    <enclosure url="http://www.example.com/abcd.mp3" length="402024" type="audio/mpeg"/>
    <guid>http://www.example.com/abcd.mp3</guid> 

I've SO'd the question and looked at xmlstarlet, xmlgrep, xmlsh. And then there's things like Osmosis which looks powerful but requires node and is lacking practical documentation. Ideally using as few external dependencies as possible (although Python 3.6 is installed).

After a morning at this I'm starting to wonder if I'm over-thinking/complicating things. Any pointers appreciated.

lardconcepts
  • 97
  • 1
  • 6
  • Is a Perl solution acceptable? We really need a proper sample of the XML. – Borodin Mar 30 '17 at 12:20
  • You can use XSLT/Xpath to produce a text file as the output (http://stackoverflow.com/questions/5908668/use-xsl-to-output-plain-text#5910638) – Hobbes Mar 30 '17 at 12:35
  • Thanks @Borodin. Perl would be fine - although regarding a "proper sample", I thought xml of a specific type (eg podcast-1.0.dtd) was fairly standard? – lardconcepts Mar 30 '17 at 12:47
  • Thanks @Hobbes - I'm not entirely sure how I'd apply this to a command line scenario? I thought XSLT was a stylesheet that you applied to a web page to render an xml file in-browser? – lardconcepts Mar 30 '17 at 12:49
  • XSLT is a language for transforming XML documents. The stylesheet you mention is just one of its applications. There are command-line XSLT processors, e.g. http://www.saxonica.com and MSXSL. – Hobbes Mar 30 '17 at 12:52

1 Answers1

0

OK, after much messing with stylesheets, I stumbled upon BeautifulSoup.

And the answer is as simple as this (HT to this guide)

pip install bs4
pip install lxml

and then

#! /usr/bin/env python3
from bs4 import BeautifulSoup
import requests
url = 'http://www.example.com/somepodcast.xml'
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml') # choose lxml parser
titles = soup.find_all('title')
for title in titles:
    print(title) # or do whatever.

Thanks for the other suggestions, but this works for me as far as not messing with xpaths, regex etc.

lardconcepts
  • 97
  • 1
  • 6