0

I'm learning Python. I've set myself a wee goal of building a RSS scraper. I'm trying to gather the Author, Link and Title. From there I want to write to a CSV.

I'm encountering some problems. I've search for the answer since last night but can't seem to find a solution. I do have a feeling that is a bit of knowledge that I'm missing between what feedparser is parsing and moving it to a CSV but I don't have the vocabulary yet to know what to Google.

  1. How do I remove special characters such as '[' and '''?
  2. How do I a write author, link and title to a new row when I'm creating the new file?

1 Special Characters

rssurls = 'http://feeds.feedburner.com/TechCrunch/'

techart = feedparser.parse(rssurls)
# feeds = []

# for url in rssurls:
#     feedparser.parse(url)
# for feed in feeds:
#     for post in feed.entries:
#         print(post.title)

# print(feed.entires)

techdeets = [post.author + " , " + post.title + " , " + post.link  for post in techart.entries]
techdeets = [y.strip() for y in techdeets]
techdeets

Output: I get the information I need but the .strip tag doesn't strip.

['Darrell Etherington , Spin launches first city-sanctioned dockless bike sharing in Bay Area , http://feedproxy.google.com/~r/Techcrunch/~3/BF74UZWBinI/', 'Ryan Lawler , With $5.3 million in funding, CarDash wants to change how you get your car serviced , http://feedproxy.google.com/~r/Techcrunch/~3/pkamfdPAhhY/', 'Ron Miller , AlienVault plug-in searches for stolen passwords on Dark Web , http://feedproxy.google.com/~r/Techcrunch/~3/VbmdS0ODoSo/', 'Lucas Matney , Firefox for Windows gets native WebVR support, performance bumps in latest update , http://feedproxy.google.com/~r/Techcrunch/~3/j91jQJm-f2E/',...]

2) Writing to CSV

import csv

savedfile = open('/test1.txt', 'w')
savedfile.write(str(techdeets) + "/n")
savedfile.close()

import pandas as pd
df = pd.read_csv('/test1.txt', encoding='cp1252')
df

Output: The output was a dataframe with only 1 row and multiple columns.

Nick Duddy
  • 910
  • 6
  • 20
  • 36
  • 1
    You can use a regex to eliminate anything that is not in `[a-zA-Z0-9_]` like so: `re.sub(r'\w', '', string)` where `r'\w'` is the raw string for the (shorthand of) the character range above), `''` is the replacement (in this case, an empty string), and `string` is the arbitrary name for the string you want to operate on. – GH05T Aug 08 '17 at 14:10
  • `techdeets = [post.author + " , " + post.title + " , " + post.link for post in techart.entries]` replace with: `techdeets = [','.join([*post]) for post in techart.entries]` – GH05T Aug 08 '17 at 14:14
  • `savedfile = open('/test1.txt', 'a')` opens the file in **append** mode. – GH05T Aug 08 '17 at 14:16

1 Answers1

2

You are almost there :-)

How about using pandas to create a dataframe first then save it, something like this "continuing from your code":

df = pd.DataFrame(columns=['author', 'title', 'link'])
for i, post in enumerate(techart.entries):
    df.loc[i] = post.author, post.title, post.link

then you can save it:

df.to_csv('myfilename.csv', index=False)

OR

you can also write into the dataframe straight from the feedparser entries:

>>> import feedparser
>>> import pandas as pd
>>>
>>> rssurls = 'http://feeds.feedburner.com/TechCrunch/'
>>> techart = feedparser.parse(rssurls)
>>>
>>> df = pd.DataFrame()
>>>
>>> df['author'] = [post.author for post in techart.entries]
>>> df['title'] = [post.title for post in techart.entries]
>>> df['link'] = [post.link for post in techart.entries]
Aziz Alto
  • 19,057
  • 5
  • 77
  • 60
  • I used the first solution and it created the dataframe. It hadn't occurred to use pandas in that way, I didn't even think it was possible. However, there is one problem it's now putting the author, title and link in one column but then replicating that column 3 times – Nick Duddy Aug 09 '17 at 18:48
  • Second example worked. Would be interested in finding out how to make the for loop work as it seems much more efficient, I think? – Nick Duddy Aug 09 '17 at 20:48
  • 1
    @NickDuddy that's true my bad! I just updated the for loop to get the entries correct :) – Aziz Alto Aug 10 '17 at 14:43
  • 1
    However, in general I think it is not efficient to enter rows one by one. Specially with large dataframe see the comment in this thread https://stackoverflow.com/q/10715965/2839786 – Aziz Alto Aug 10 '17 at 15:06
  • 1
    Also I see that pandas strips out all the characters I didn't want! This works thanks! – Nick Duddy Aug 10 '17 at 21:42
  • 1
    Good to know that :-) – Aziz Alto Aug 10 '17 at 22:17