Follow link in forum to scrape thread (comments) using BS4

Question

I have a forum with 3 threads. I am trying to scrape the data in all three posts. so I need to follow the href link to each post and scrape the data. this is giving me an error and I'm not sure what I am dong wrong...

import csv
import time
from bs4 import BeautifulSoup
import requests

source = requests.get('https://mainforum.com').text

soup = BeautifulSoup(source, 'lxml')

#get the thread href (thread_link)
for threads in soup.find_all('p', class_= 'small'):
    thread_name = threads.text
    thread_link = threads.a.get('href')# there are three threads and this gets all 3 links
    print (thread_link)

Rest of the code is where I am having an issue with?

# request the individual thread links
for follow_link in thread_link:
    response = requests.get(follow_link)

    #parse thread link
    soup= BeautifulSoup(response, 'lxml')

    #print Data
    for p in soup.find_all('p'):
        print(p)

Dear Blake - just would be helpful to fully understand and get a grasp if you would post the full code. This might help (especially me ) all learning folks here to extend the insights and the understanding. - thx in advance - yours zero — zero, May 07 '20 at 23:22
does it successfully navigate to the other links? what happens if you print the whole html document? — bherbruck, May 07 '20 at 23:29
@TenaciousB nope no luck with any of the links... tbh... I have never done link navigation with BS4... most of the guides tell you how to get the href but not what you do ONCE you get it... I can print the href fine (top section of code), thats about it... I am pretty much writing over each link with that loop and that might be a bit of a problem but something I can deal with later... what I need right now is for it to atleast navigate into 1 of the links... the error I am getting is: requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h? — blake, May 07 '20 at 23:32
You might be missing the `.text` in `response = requests.get(follow_link)` — Darien Schettler, May 07 '20 at 23:52

score 1 · Accepted Answer · answered May 07 '20 at 23:56

As to your schema error...

You're getting the schema error because you are overwriting one link over and over. Then you attempt to call that link as if it were a list of links. At this point it is a string and you just iterate through the characters (starting with h) hence the error.

See here: requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied

As to the general query and how to solve something like this...

If I was to do this the flow would go as follows:

Get the three hrefs (similar to what you've already done)
Use a function that scrapes the thread hrefs individually and returns whatever you want them to return
Save/append that returned information wherever you want.
Repeat

Something like this perhaps

import csv
import time
from bs4 import BeautifulSoup
import requests

source = requests.get('https://mainforum.com')

soup = BeautifulSoup(source.content, 'lxml')

all_thread_info = []

def scrape_thread_link(href):
    response = requests.get(href)

    #parse thread link
    soup= BeautifulSoup(response.content, 'lxml')

    #return data
    return [p.text for p in soup.find_all('p')]

#get the thread href (thread_link)
for threads in soup.find_all('p', class_= 'small'):
    this_thread_info = {}
    this_thread_info["thread_name"] = threads.text
    this_thread_info["thread_link"] = threads.a.get('href')
    this_thread_info["thread_data"] = scrape_thread_link(this_thread_info["thread_link"])
    all_thread_info.append(this_thread_info)

print(all_thread_info)

There's quite a lot unspecified in the original question so I made some assumptions. Ideally though you can see the gist.

Also note I prefer to use the .content of the response instead of .text.

Hey Thanks, It seems to be working, made some adjustments. works with 1 link and three links. would love your feedback on whether I am still okay? — blake, May 08 '20 at 00:25

score 0 · Answer 2 · answered May 08 '20 at 00:20

0

@Darien Schettler I made some changes/adjustments to the code would love to hear if I messed up somewhere?

all_thread_info = []

def scrape_thread_link(href):
    response = requests.get(href)
    soup= BeautifulSoup(response.content, 'lxml')

    for Thread in soup.find_all(id= 'discussionReplies'):
        Thread_Name = Thread.find_all('div', class_='xg_user_generated')
        for Posts in Thread_Name:
            print(Posts.text)


for threads in soup.find_all('p', class_= 'small'):
    thread_name = threads.text
    thread_link = threads.a.get('href')
    thread_data = scrape_thread_link(thread_link)
    all_thread_info.append(thread_data)

answered May 08 '20 at 00:20

blake

35
6

You're missing a return statement within the scrape_thread_link function. I think you want to create a list and append `Posts.text` to that list. You can then return that list. Then you would use `all_thread_info.extend(thread_data)` instead of `append`. Also as a note you should be naming variables where the first letter is capitalized. See here for more info on naming conventions - https://realpython.com/python-pep8/ If this doesn't make sense than make another question and post the link to that question as a reply to this comment. I will make an answer on that new question. – Darien Schettler May 08 '20 at 20:43
Also you shouldn't post additional questions as 'answers' to your original question. They'll end up getting removed my moderators. Just create a new question and you can reference it in a comment. – Darien Schettler May 08 '20 at 20:45

Follow link in forum to scrape thread (comments) using BS4

2 Answers2