2

The script should find the addresses of subpages with articles and collect the necessary data from them. The data should go to the database but I don't know how to make the script pull the content of each article from every page of the blog.

import requests
from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer
import nltk
import matplotlib.pyplot as plt
import seaborn as sns

url = 'https://xxx/'

r = requests.get(url)
# Extract HTML
html = r.text
# Create a BeautifulSoup object from the HTML
soup = BeautifulSoup(html, "html5lib")

# Get the text

text = soup.get_text()
# Create tokenizer
tokenizer = RegexpTokenizer('\w+')

# Create tokens
tokens = tokenizer.tokenize(text)

# Initialize new list
words = []

# Loop through list

for word in tokens:
    words.append(word.lower())

# Get English stopwords and print some of them
sw = nltk.corpus.stopwords.words('english')

# Initialize new list
words_ns = []

for word in words:
    if word not in sw:
        words_ns.append(word)

# plotting
freqdist1 = nltk.FreqDist(words_ns)
freqdist1.plot(25)

print(soup.get_text())
tbone
  • 1,148
  • 5
  • 19
  • 35
  • hi there tbone thanks for this great question - it hits the point of a problem i also want to solve. - Thanks for this great thread! – zero Jun 14 '20 at 10:19

1 Answers1

2

You could do the whole thing with beautifulsoup as requests. The text extraction code is by @nmgeek; the same question there has other methods to choose from. I am guessing you can then handle the text with nltk. The method is nice as you can play with which selectors you add to list. You can achieve something similar with selector list passed to select i.e. [item.text for item in soup.select('selector list goes here')

Edit: Below gets you all the links but seems website blocks you after a while. Have a look at rotating IPs and these/User-Agents in the loop over all_links

If you have to resort to selenium at least you have the list of all article links to you can loop over and .get with selenium

import requests
from bs4 import BeautifulSoup as bs

url = 'https://teonite.com/blog/page/{}/index.html'
all_links = []

headers = {
    'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'User-Agent' : 'Mozilla/5.0'
}
with requests.Session() as s:
    r = s.get('https://teonite.com/blog/')
    soup = bs(r.content, 'lxml')
    article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
    all_links.append(article_links)
    num_pages = int(soup.select_one('.page-number').text.split('/')[1])

    for page in range(2, num_pages + 1):
        r = s.get(url.format(page))
        soup = bs(r.content, 'lxml')
        article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
        all_links.append(article_links)

    all_links = [item for i in all_links for item in i]

    for article in all_links:
        #print(article)
        r = s.get(article, headers = headers)
        soup = bs(r.content, 'lxml')
        [t.extract() for t in soup(['style', 'script', '[document]', 'head', 'title'])]
        visible_text = soup.getText()   # taken from https://stackoverflow.com/a/19760007/6241235 @nmgeek
        # here I think you need to consider IP rotation/User-Agent changing
        try:
            print(soup.select_one('.post-title').text)
        except:
            print(article)
            print(soup.select_one('h1').text)
            break
        # do something with text

Adding in selenium seems to definitely solve bad request problem of being blocked:

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver

url = 'https://teonite.com/blog/page/{}/index.html'
all_links = []

with requests.Session() as s:
    r = s.get('https://teonite.com/blog/')
    soup = bs(r.content, 'lxml')
    article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
    all_links.append(article_links)
    num_pages = int(soup.select_one('.page-number').text.split('/')[1])

    for page in range(2, num_pages + 1):
        r = s.get(url.format(page))
        soup = bs(r.content, 'lxml')
        article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
        all_links.append(article_links)

all_links = [item for i in all_links for item in i]

d = webdriver.Chrome()

for article in all_links:
    d.get(article)
    soup = bs(d.page_source, 'lxml')
    [t.extract() for t in soup(['style', 'script', '[document]', 'head', 'title'])]
    visible_text = soup.getText()   # taken from https://stackoverflow.com/a/19760007/6241235 @nmgeek

    try:
        print(soup.select_one('.post-title').text)
    except:
        print(article)
        print(soup.select_one('h1').text)
        break #for debugging
    # do something with text
d.quit()
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Thanks! It works for me but it doesn't go to the next pages – tbone May 27 '19 at 19:02
  • It does for me - if you mean loops articles. If I put _print(soup.select_one('.post-title').text)_ into loop as last line I see each post title. – QHarr May 27 '19 at 19:08
  • Or do you mean each blog has multiple pages? – QHarr May 27 '19 at 19:08
  • Yes the blog has a 7 pages – tbone May 27 '19 at 19:12
  • Hi, how can I add to this code a function that would give 10 most common words with their numbers and 10 most common words with their numbers per author? This code is great but it's different from mine and I don't know how to define these functions – tbone May 28 '19 at 14:16
  • Bag of words maybe? Something optimised for this over alternative of maybe creating one long string and using split to generate array - loop add to dictionary count as value word as key. Don’t know if limitations on keys. – QHarr May 28 '19 at 14:22
  • Pretty sure there must be an existing answer related to this on SO or code review site – QHarr May 28 '19 at 14:23
  • I tried to use existing loops found on OP but it fails – tbone May 28 '19 at 17:19