0

I have a the below web service : 'https://news.google.com/news/rss/?ned=us&hl=en'

I need to parse it and get the title and date values of each item in the XML file.

I have tried to get the data to an xml file and i am trying to parse it but i see all blank values:

import requests
import xml.etree.ElementTree as ET

response = requests.get('https://news.google.com/news/rss/?ned=us&hl=en')
with open('text.xml','w') as xmlfile:
    xmlfile.write(response.text)

with open('text.xml','rt') as f:
    tree = ET.parse(f)

for node in tree.iter():
    print (node.tag, node.attrib)

I am not sure where i am going wrong . I have to somehow extract the values of title and published date of each and every item in the XML.

Thanks for any answers in advance.

Subhayan Bhattacharya
  • 5,407
  • 7
  • 42
  • 60

1 Answers1

0

@Ilja Everilä is right, you should use feedparser. For sure there is no need to write any xml file... except if you want to archive it.

I didn't really get what output you expected but something like this works (python3)

import feedparser

url = 'https://news.google.com/news/rss/?ned=us&hl=en'
d = feedparser.parse(url)
#print the feed title
print(d['feed']['title'])
#print tuples (title, tag)
print([(d['entries'][i]['title'], d['entries'][i]['tags'][0]['term']) for i in range(len(d['entries']))] )

to explicitly print it as utf8 strings use:

print([(d['entries'][i]['title'].encode('utf8'), d['entries'][i]['tags'][0]['term'].encode('utf8')) for i in range(len(d['entries']))])

Maybe if you show your expected output, we could help you to get the right content from the parser.

mquantin
  • 1,085
  • 8
  • 23
  • I am getting a strange issue here : – Subhayan Bhattacharya Jul 03 '17 at 14:32
  • Traceback (most recent call last): File "Hacker.py", line 8, in print((d['entries'][i]['title'], d['entries'][i]['published'])) File "C:\Users\bhatsubh\AppData\Local\Programs\Python\Python35\lib\encodings\c p437.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_map)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u2014' in position 50: character maps to – Subhayan Bhattacharya Jul 03 '17 at 14:32
  • My main purpose is to take from user a datetime value and use it to retrieve posts published after that date – Subhayan Bhattacharya Jul 03 '17 at 14:33
  • this error is due to string character encoding issue, it's because your python is trying to represent the bytes as a ascii string. Try: `print([(d['entries'][i]['title'].encode('utf8'), d['entries'][i]['tags'][0]['term'].encode('utf8')) for i in range(len(d['entries']))] )` – mquantin Jul 03 '17 at 14:58