Suppose I have a directory of .html files -- each of which is structured identically, although each has different content contained within the tags. Each .html file is essentially a news article in which I use BeautifulSoup to extract the date, author(s), article text, source, and wordcount.
The code I posted below is what I have developed to achieve this and seems to work fine.
However, I need to accomplish two things: first, I need the script to be able to batch process an entire directory of .html files instead of opening one at a time. Second, I need to append all the extracted data into a pandas data frame (that I will eventually write to a .csv).
For context, I have roughly 3,000 .html files (news articles) to process.
Any help with this would be much appreciated! Thanks for your time.
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("test.html"), "html.parser")
date = soup.select('span.display-date')[0].text.strip()
title = soup.select('h1.document-view__title')[0].text.strip()
article = soup.findAll('div',attrs={"class":"document-view__body document-view__body--ascii"})
for x in article:
print(x.find('p').text)
author = soup.select('span.author')[0].text.strip()
source = soup.select('span.source')[0].text.strip()
wordcount = soup.select('span.word-count')[0].text.strip()