I have the following code which provides me with the columns: Authors, Date, Blog name, Link and blog category
To further enhance this, I want to add the word count of the article and the author, separately
import requests
import pandas as pd
import numpy as np
from time import sleep
from random import randint
from bs4 import BeautifulSoup
# VARIABLE TO DEFINE A RANGE BASED ON NO.OF PAGES
pages = np.arange(1, 2)
# DEFINING CUSTOM VARIABLES
author_and_dates_=[]
title_blognames_links_ = []
# LOOP TO RETRIEVE TITLE, BLOG NAMES, LINKS, AUTHORS AND DATE PUBLISHED
for page in pages:
page="https://www.bartonassociates.com/blog/tag/Infographics/p" + str(page)
sleep(randint(2,10))
soup = BeautifulSoup(requests.get(page).content, 'html.parser')
#Information on title, blog names and their links
for h4 in soup.select("h4"):
for h2 in soup.select("h2"):
title_blognames_links_.append((h4.get_text(strip=True), h4.a["href"], h2.get_text(strip=True)[11:]))
#Information of authors and dates
for tag in soup.find_all(class_="author"):
author_and_dates_.append(tag.get_text(strip=True))
The updated columns I am trying to achieve are: Authors, Date, Blog name, Link, blog category, description count, about count
Example: For the 1st article: https://www.bartonassociates.com/blog/updated-can-an-np-do-that-infographic
I am trying to get the count of everything from "Happy" to "today" as my "description count" (with the similar concept for the about count)
Tried Solution
I was able to disseminate each of the links under the 'new' variable and get the required text for "div.cf > p". However they are texts for all the links, how could I map each of the paragraph's to their respective links?
new = []
# LOOP TO RETRIEVE TITLE, BLOG NAMES, LINKS, AUTHORS AND DATE PUBLISHED
for page in pages:
page="https://www.bartonassociates.com/blog/tag/Infographics/p" + str(page)
sleep(randint(2,10))
soup = BeautifulSoup(requests.get(page).content, 'html.parser')
for h4 in soup.select("h4"):
new.append((h4.get_text(strip=True), h4.a["href"])[1])
for link in new:
link_soup = BeautifulSoup(requests.get(link).content, 'html.parser')
for p in link_soup.select("div.cf > p"):
print(p.get_text(strip=True))
Even if I add "txt = p.get_text(strip=True)" I just get the last articles author bio and not all the information