0

I want to create a dataframe of Substack posts from all the newsletter I subscribe to. But using feedparser + Substack's RSS feeds only seem to go back ~20 posts—even if a particular newsletter has hundreds of old posts.

Is there a way to use RSS to get all the old posts too? Or another method to get the same data I can using the RSS feed that doesn't involve scraping/beautifulSoup?

import feedparser
import pandas as pd

rawrss = ['https://heathercoxrichardson.substack.com/feed', 'https://marcstein.substack.com/feed']

posts = []
for url in rawrss:
    feed = feedparser.parse(url)
    for post in feed.entries:
        posts.append((post.title, post.link, post.summary, post.summary_detail, post.content, post.published))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary', 'summary_detail', 'content', 'published'])
print(df)
user53526356
  • 934
  • 1
  • 11
  • 25
  • I've never used feedparser but take a look at this post https://stackoverflow.com/questions/1676223/feedparser-retrieve-old-messages-from-google-reader . According to it the RSS feeds you're looking at aren't storing more historical data? – rayad Jun 11 '22 at 19:24

1 Answers1

3

There's an unofficial Substack API available for that. Here's a curl request that fetches the second page of the most recent posts:

curl https://ava.substack.com/api/v1/posts\?limit\=50\&offset\=50

Note that this is unofficial API so this can change at any time.

shime
  • 8,746
  • 1
  • 30
  • 51