1

I have the following code which provides me with the columns: Authors, Date, Blog name, Link and blog category

To further enhance this, I want to add the word count of the article and the author, separately

import requests
import pandas as pd
import numpy as np
from time import sleep
from random import randint
from bs4 import BeautifulSoup

# VARIABLE TO DEFINE A RANGE BASED ON NO.OF PAGES
pages = np.arange(1, 2)

# DEFINING CUSTOM VARIABLES
author_and_dates_=[]
title_blognames_links_ = []

# LOOP TO RETRIEVE TITLE, BLOG NAMES, LINKS, AUTHORS AND DATE PUBLISHED
for page in pages:
    
    page="https://www.bartonassociates.com/blog/tag/Infographics/p" + str(page) 
    sleep(randint(2,10))
    soup = BeautifulSoup(requests.get(page).content, 'html.parser')
     
    #Information on title, blog names and their links
    for h4 in soup.select("h4"):
        for h2 in soup.select("h2"):
            title_blognames_links_.append((h4.get_text(strip=True), h4.a["href"], h2.get_text(strip=True)[11:]))
        
    #Information of authors and dates
    for tag in soup.find_all(class_="author"):
        author_and_dates_.append(tag.get_text(strip=True))

The updated columns I am trying to achieve are: Authors, Date, Blog name, Link, blog category, description count, about count

Example: For the 1st article: https://www.bartonassociates.com/blog/updated-can-an-np-do-that-infographic

I am trying to get the count of everything from "Happy" to "today" as my "description count" (with the similar concept for the about count)

Tried Solution

I was able to disseminate each of the links under the 'new' variable and get the required text for "div.cf > p". However they are texts for all the links, how could I map each of the paragraph's to their respective links?

new = []

# LOOP TO RETRIEVE TITLE, BLOG NAMES, LINKS, AUTHORS AND DATE PUBLISHED
for page in pages:
    
    page="https://www.bartonassociates.com/blog/tag/Infographics/p" + str(page) 
    sleep(randint(2,10))
    soup = BeautifulSoup(requests.get(page).content, 'html.parser')


    for h4 in soup.select("h4"):
        new.append((h4.get_text(strip=True), h4.a["href"])[1])

for link in new:
    link_soup = BeautifulSoup(requests.get(link).content, 'html.parser')

    for p in link_soup.select("div.cf > p"):
        print(p.get_text(strip=True))

Even if I add "txt = p.get_text(strip=True)" I just get the last articles author bio and not all the information

Swas
  • 49
  • 6
  • 1
    Okay. What specific issue are you having with counting the words in the HTML response? – OneCricketeer Sep 07 '21 at 20:51
  • None actually, I am just not able to get it. Even if I do. I get all the count of text I don't require from the 'p' tag – Swas Sep 07 '21 at 21:54
  • Have you tried to only get the `p` tags that you are interested in? Specifically, `soup.select("div.cf > p")`? – OneCricketeer Sep 07 '21 at 22:03
  • Yes, I had. Segregated the cf class using the p tag – Swas Sep 07 '21 at 22:14
  • Can you [edit] your post to include that code and what didn't work when you tried to count the words in those sections? – OneCricketeer Sep 07 '21 at 22:18
  • I have updated the iterations – Swas Sep 07 '21 at 22:35
  • There is no `p` with class `cf`, and the first loops all divs, not the paragraphs. Do you get the expected text with `soup.select("div.cf > p")` like I mentioned? From there can you [count the words](https://stackoverflow.com/questions/18827198/python-count-number-of-words-in-a-list-strings)? – OneCricketeer Sep 07 '21 at 23:38
  • Also worth point out that `https://www.bartonassociates.com/blog/tag/Infographics/p2`, for example, is the page you're scraping, **not** the one linked in the question that includes the text you're referring to – OneCricketeer Sep 07 '21 at 23:41

1 Answers1

1

Example: For the 1st article

You're not getting the HTML for that article anywhere...

If I understand correctly, you have gotten a list of links, but you want the content of the articles those links refer to

Therefore, you need to make a new request/parser for each of those links

for link in title_blognames_links_:
    link_soup = BeautifulSoup(requests.get(link).content, 'html.parser')
    
    for p in link_soup.select("div.cf > p"):
        txt = p.get_text(strip=true)

I suggest you define some subroutine function that accepts a link to an "article" that you want to parse, then return the data you expect. Then explicitly test that using the link given in your post and other articles.

From there, see, for example Python - Count number of words in a list strings


Worth mentioning, that the site may prevent you from making lots of requests (requesting all articles in the list in quick succession), and there is little that can be done about that.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Hey, just so that you are aware, I have made the changes in my actual code to show the updates on the script. – Swas Sep 08 '21 at 14:22