How to batch parse (e.g., extract specific textual elements) a directory of .html files then add each element to a pandas data frame?

Question

Suppose I have a directory of .html files -- each of which is structured identically, although each has different content contained within the tags. Each .html file is essentially a news article in which I use BeautifulSoup to extract the date, author(s), article text, source, and wordcount.

The code I posted below is what I have developed to achieve this and seems to work fine.

However, I need to accomplish two things: first, I need the script to be able to batch process an entire directory of .html files instead of opening one at a time. Second, I need to append all the extracted data into a pandas data frame (that I will eventually write to a .csv).

For context, I have roughly 3,000 .html files (news articles) to process.

Any help with this would be much appreciated! Thanks for your time.

import pandas as pd
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("test.html"), "html.parser")

date = soup.select('span.display-date')[0].text.strip()

title = soup.select('h1.document-view__title')[0].text.strip()

article = soup.findAll('div',attrs={"class":"document-view__body document-view__body--ascii"})
for x in article:
    print(x.find('p').text)

author = soup.select('span.author')[0].text.strip()

source = soup.select('span.source')[0].text.strip()

wordcount = soup.select('span.word-count')[0].text.strip()

CJR · Accepted Answer · 2019-08-28T00:10:42.423

1

I can't guess what you want to do without example data but do this.

import glob
import pandas as pd
from bs4 import BeautifulSoup

pandas_list = []
for filename in glob.glob('*.html'):
    soup = BeautifulSoup(open(filename), "html.parser")
    new_data_frame = process_soup(s)
    pandas_list.append(new_data_frame)

final_data_frame = pd.concat(pandas_list)

I'm going to assign you the following as a homework assignment:

def process_soup(s):
    data = {'author': s.select('span.author')[0].text.strip(),
            'source': s.select('span.source')[0].text.strip()}
    return pd.DataFrame(data, index=[0])

Complete with whatever else you want to extract out.

edited Aug 28 '19 at 00:10

answered Aug 27 '19 at 23:53

CJR

3,916
2
10
23

Hello, here is a sample of my data. https://pastebin.com/ZAyTFYfW -- I want to extract the author, article title, date, source, and article text. I am able to do that with the code I posted above, I am just not sure how to iterate over every file in my directory and then assign the author variable to something like df['author'] for the pandas data frame. – youngguv Aug 28 '19 at 00:01
This worked! I just had to replace 's' with 'soup' in your code. – youngguv Aug 28 '19 at 00:17

Samha' · Answer 2 · 2019-08-28T01:20:05.060

Use pool.map to incorporate mutlithreading into your logic.

In this example, a pool of 10 threads is created. U can increase the number based on your machine specs.

Also note that I couldn't figure out the architecture of the article field, but this is immaterial to the general concept anyway.

from multiprocessing.dummy import Pool as ThreadPool 
from bs4 import BeautifulSoup
import pandas as pd
from os import walk

pool = ThreadPool(10)

# update: to get all html files in a directory instead of feeding them to the script
htmls = []
for root, dirs, files in walk('./directory_containing_html_files'):
  for file in files:
    if r'.*\.html'.match(file):
      htmls.append(file)

# htmls = [
#   'file1.html',
#   'file2.html',
#   'file3.html'
#    ...
#   ]

df = pd.DataFrame(columns=['date', 'title', 'author', 'source', 'wordcount'])
data_list = []

def crawl_html(html_file):
  soup = BeautifulSoup(open(html_file), 'html.parser')
  data_list.append({
    'date':           soup.select('span.display-date')[0].text.strip()
    'title':          soup.select('h1.document-view__title')[0].text.strip()
    'author':         soup.select('span.author')[0].text.strip()
    'source':         soup.select('span.source')[0].text.strip()
    'wordcount':      soup.select('span.word-count')[0].text.strip()
  })

results = pool.map(crawl_html, htmls)

print(df.append(data_list))

pool.close()
pool.join()

In this solution, do I have to specify every single one of my .html filenames? — youngguv, Aug 28 '19 at 00:58
u can give python a directory and scan the files inside it, will update my answer with an example — Samha', Aug 28 '19 at 00:59
This might seem like it processes multiple things at a time, but because of GIL in python, it will be processing one at a time. You can better try spawning as a separate process and try comparing the performance of both. — Arjun Sankarlal, Jun 04 '20 at 17:48

How to batch parse (e.g., extract specific textual elements) a directory of .html files then add each element to a pandas data frame?

2 Answers2