1

I would like to ask for help with a rss program. What I'm doing is collecting sites which are containing relevant information for my project and than check if they have rss feeds. The links are stored in a txt file(one link on each line). So I have a txt file with full of base urls what are needed to be checked for rss.

I have found this code which would make my job much easier.

import requests  
from bs4 import BeautifulSoup  

def get_rss_feed(website_url):
    if website_url is None:
        print("URL should not be null")
    else:
        source_code = requests.get(website_url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text)
        for link in soup.find_all("link", {"type" : "application/rss+xml"}):
            href = link.get('href')
            print("RSS feed for " + website_url + "is -->" + str(href))

get_rss_feed("http://www.extremetech.com/")

But I would like to open my collected urls from the txt file, rather than typing each, one by one.

So I have tryed to extend the program with this:

from bs4 import BeautifulSoup, SoupStrainer

with open('test.txt','r') as f:
    for link in BeautifulSoup(f.read(), parse_only=SoupStrainer('a')): 
        if link.has_attr('http'): 
            print(link['http'])

But this is returning with an error, saying that beautifoulsoup is not a http client.

I have also extended with this:

def open()
    f = open("file.txt")
    lines = f.readlines()
    return lines

But this gave me a list separated with ","

I would be really thankfull if someone would be able to help me

Platy
  • 11
  • 1
  • 3

2 Answers2

1

Typically you'd do something like this:

with open('links.txt', 'r') as f:
    for line in f:
        get_rss_feed(line)

Also, it's a bad idea to define a function with the name open unless you intend to replace the builtin function open.

nemetroid
  • 2,100
  • 13
  • 20
  • Thank you I give it a try. thanks for the advice with open, i have missed it – Platy Jun 24 '16 at 21:06
  • I have inserted your suggested code into the program. Now it returns without any error message, but also without results. root@loko:~# sudo python /root/Desktop/rsskeres.py root@loko:~# sudo python /root/Desktop/rsskeres.py if I print out lines from your code i get the url root@loko:~# sudo python /root/Desktop/nyit3.py http://www.theguardian.com/ and this is the return what the original program gives: root@loko:~# sudo python /root/Desktop/rsskeres.py RSS feed for http://www.theguardian.com/is --> http://www.theguardian.com/international/rss What could be the problem? – Platy Jun 24 '16 at 21:51
  • I imagine you would want `line.rstrip()` – Padraic Cunningham Jun 24 '16 at 23:20
0

i guess you can make it by using urllib

    import urllib
    f = open('test.txt','r')
    #considering each url in a new line...
    while True:
     URL = f.readline()
     if not URL:
       break
     mycontent=urllib.urlopen(URL).read()
danielarend
  • 1,379
  • 13
  • 26