Find subpage urls with articles and collect data from them

Question

The script should find the addresses of subpages with articles and collect the necessary data from them. The data should go to the database but I don't know how to make the script pull the content of each article from every page of the blog.

import requests
from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer
import nltk
import matplotlib.pyplot as plt
import seaborn as sns

url = 'https://xxx/'

r = requests.get(url)
# Extract HTML
html = r.text
# Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html5lib")

# Get the text

text = soup.get_text()
# Create tokenizer
tokenizer = RegexpTokenizer('\w+')

# Create tokens
tokens = tokenizer.tokenize(text)

# Initialize new list
words = []

# Loop through list

for word in tokens:
    words.append(word.lower())

# Get English stopwords and print some of them
sw = nltk.corpus.stopwords.words('english')

# Initialize new list
words_ns = []

for word in words:
    if word not in sw:
        words_ns.append(word)

# plotting
freqdist1 = nltk.FreqDist(words_ns)
freqdist1.plot(25)

print(soup.get_text())

hi there tbone thanks for this great question - it hits the point of a problem i also want to solve. - Thanks for this great thread! — zero, Jun 14 '20 at 10:19

QHarr · Accepted Answer · 2019-05-28T05:33:22.900

You could do the whole thing with beautifulsoup as requests. The text extraction code is by @nmgeek; the same question there has other methods to choose from. I am guessing you can then handle the text with nltk. The method is nice as you can play with which selectors you add to list. You can achieve something similar with selector list passed to select i.e. [item.text for item in soup.select('selector list goes here')

Edit: Below gets you all the links but seems website blocks you after a while. Have a look at rotating IPs and these/User-Agents in the loop over all_links

If you have to resort to selenium at least you have the list of all article links to you can loop over and .get with selenium

import requests
from bs4 import BeautifulSoup as bs

url = 'https://teonite.com/blog/page/{}/index.html'
all_links = []

headers = {
    'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'User-Agent' : 'Mozilla/5.0'
}
with requests.Session() as s:
    r = s.get('https://teonite.com/blog/')
    soup = bs(r.content, 'lxml')
    article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
    all_links.append(article_links)
    num_pages = int(soup.select_one('.page-number').text.split('/')[1])

    for page in range(2, num_pages + 1):
        r = s.get(url.format(page))
        soup = bs(r.content, 'lxml')
        article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
        all_links.append(article_links)

    all_links = [item for i in all_links for item in i]

    for article in all_links:
        #print(article)
        r = s.get(article, headers = headers)
        soup = bs(r.content, 'lxml')
        [t.extract() for t in soup(['style', 'script', '[document]', 'head', 'title'])]
        visible_text = soup.getText()   # taken from https://stackoverflow.com/a/19760007/6241235 @nmgeek
        # here I think you need to consider IP rotation/User-Agent changing
        try:
            print(soup.select_one('.post-title').text)
        except:
            print(article)
            print(soup.select_one('h1').text)
            break
        # do something with text

Adding in selenium seems to definitely solve bad request problem of being blocked:

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver

url = 'https://teonite.com/blog/page/{}/index.html'
all_links = []

with requests.Session() as s:
    r = s.get('https://teonite.com/blog/')
    soup = bs(r.content, 'lxml')
    article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
    all_links.append(article_links)
    num_pages = int(soup.select_one('.page-number').text.split('/')[1])

    for page in range(2, num_pages + 1):
        r = s.get(url.format(page))
        soup = bs(r.content, 'lxml')
        article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
        all_links.append(article_links)

all_links = [item for i in all_links for item in i]

d = webdriver.Chrome()

for article in all_links:
    d.get(article)
    soup = bs(d.page_source, 'lxml')
    [t.extract() for t in soup(['style', 'script', '[document]', 'head', 'title'])]
    visible_text = soup.getText()   # taken from https://stackoverflow.com/a/19760007/6241235 @nmgeek

    try:
        print(soup.select_one('.post-title').text)
    except:
        print(article)
        print(soup.select_one('h1').text)
        break #for debugging
    # do something with text
d.quit()

It does for me - if you mean loops articles. If I put _print(soup.select_one('.post-title').text)_ into loop as last line I see each post title. — QHarr, May 27 '19 at 19:08
Hi, how can I add to this code a function that would give 10 most common words with their numbers and 10 most common words with their numbers per author? This code is great but it's different from mine and I don't know how to define these functions — tbone, May 28 '19 at 14:16
Bag of words maybe? Something optimised for this over alternative of maybe creating one long string and using split to generate array - loop add to dictionary count as value word as key. Don’t know if limitations on keys. — QHarr, May 28 '19 at 14:22
Pretty sure there must be an existing answer related to this on SO or code review site — QHarr, May 28 '19 at 14:23

Find subpage urls with articles and collect data from them

1 Answers1