0

I'm trying to webscrape https://old.reddit.com/r/all/ and get the entries on the first page.

When I run my code, it works but for the post_text it only copies the last post on the reddit page 25 times. I know this is because its getting the entry and then posting it each time through the loop.

import requests
import urllib.request
from bs4 import BeautifulSoup as soup

my_url = 'https://old.reddit.com/r/all/'

request = urllib.request.Request(my_url,headers={'User-Agent': 'your bot 0.1'})
response = urllib.request.urlopen(request)
page_html = response.read()

page_soup = soup(page_html, "html.parser")

posts = page_soup.findAll("div", {"class": "top-matter"})
post = posts[0]

authors = page_soup.findAll("p", {"class":"tagline"})
author = authors[0]

filename = "redditAll.csv"
f = open(filename, "w")
headers = "Title of the post, Author of the post\n"
f.write(headers)

for post in posts:
    post_text = post.p.a.text.replace(",", " -")

for author in authors:
    username = author.a.text

    f.write(post_text + "," + username + "\n")
f.close()

Changed this

for post in posts:
    post_text = post.p.a.text.replace(",", " -")

for author in authors:
    username = author.a.text

To that

for post, author in zip(posts, authors):
    post_text = post.p.a.text.replace(",", " -")
    username = author.a.text

GhostCat
  • 9
  • 1
  • 1
  • 5

3 Answers3

1

You're doing the two loops separately. In your code below, you're looping through each post and assigning a string to post_text, but doing nothing else with it. When that loop is done, post_text is the last thing it has been assigned as before it moves into the authors loop and writes a string with each author and the string you have stored in post_text.

for post in posts:
    post_text = post.p.a.text.replace(",", " -")

for author in authors:
    username = author.a.text

    f.write(post_text + "," + username + "\n")

Assuming that there are an equal number of elements in posts and authors, you should be able to fix it with the following:

for i in range(len(posts)):
    post_text = posts[i].p.a.text.replace(",", " -")
    username = authors[i].a.text

    f.write(post_text + "," + username + "\n")
LTheriault
  • 1,180
  • 6
  • 15
  • Thanks, that makes so much sense. I knew the loop was incorrect but I've looking at it for so long I couldn't wrap my head around it. – GhostCat Apr 17 '20 at 16:00
  • I added your code in and I'm sure it'll work but now I'm getting `Traceback (most recent call last): File "C:\Users\Me\Desktop\webscrape\reddit.py", line 34, in f.write(post_text + "," + username + "\n") File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.752.0_x64__qbz5n2kfra8p0\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f60d' in position 195: character maps to ` – GhostCat Apr 17 '20 at 16:02
  • 1
    Someone else came across a similar issue with bs4, so hopefully there's an explanation for the particular issue there: https://stackoverflow.com/questions/27092833/unicodeencodeerror-charmap-codec-cant-encode-characters – LTheriault Apr 17 '20 at 16:03
  • 1
    Also, check out Keri's answer below. I personally think it's an even better way of handling the loop than what I gave. It's similar, but cleaner. – LTheriault Apr 17 '20 at 16:04
  • The new error is due to unicode characters in the post / author name. The top answer in LTheriault's link should fix it – Keri Apr 17 '20 at 16:06
1

LTheriault is correct, but I'd consider this more idiomatic.

for post, author in zip(posts, authors):
    post_text = post.p.a.text.replace(",", " -")
    username = author.a.text

    f.write(post_text + "," + username + "\n")
Keri
  • 356
  • 1
  • 10
  • 1
    I had a feeling I was missing a nicer way of doing but couldn't figure out what I was overlooking. This is definitely the best way of doing it, IMO. – LTheriault Apr 17 '20 at 16:02
  • I had no idea the 'zip' thing was a thing. Thank you so much. – GhostCat Apr 17 '20 at 16:12
  • "zip() should only be used with unequal length inputs when you don’t care about trailing, unmatched values from the longer iterables. If those values are important, use itertools.zip_longest() instead." https://docs.python.org/3/library/functions.html#zip – Keri Apr 17 '20 at 16:14
  • If you didn't know about zip you might not know about enumerate: https://docs.python.org/3/library/functions.html#enumerate – Keri Apr 17 '20 at 16:16
1

The problem here is that you're writing to the file object within the scope of the of the second for loop for author in authors, so you will indeed write the last value of post_text multiple times.

If you want to combine authors and posts you might zip them and them iterate over them (assuming they are the same length)


for author, post in zip(posts, authors):
    write.(f 'author: {author}, post: {post}')

I would also recommend to write to file using a context manager

eg.

with open('filename.txt', 'w') as f:
   f.write('stuff')
NomadMonad
  • 651
  • 6
  • 12
  • If you're not aware, this method of formatting text is called an f-string. It only works with python 3.6 or later. `f.write.(f 'author: {author}, post: {post}')` – Keri Apr 17 '20 at 16:07