Rename HTMLs based on DIV

Question

Although I studied the earlier questions (How to rename a file using Python), It's for me still not clear to rename all my HTML in folder x, based on the H1 of the div in my HTML file.

<div id="page_header" class="page_header_email_alerts">
    <h1>
        <span itemprop="headline">Redhill Biopharma's (RDHL) CEO Dror Ben Asher on Q4 2014 Results - Earnings Call Transcript</span>
    </h1>
</div>

Does someone have a suggestion? I have made with bs4 a solution, but it does not loop through all my htmls:

import os
from bs4 import BeautifulSoup
import textwrap

directory ='C:/Research syntheses - Meta analysis/SeekingAlpha/test/'
for filename in os.listdir(directory):
    if filename.endswith('.html'):
        fname = os.path.join(directory,filename)
        with open(fname, 'r') as f:
            soup = BeautifulSoup(f.read(),'html.parser')
            headline = soup.find(itemprop='headline').text
os.rename(filename, headline+'.html')

It seems that you want a dynamic website which has different articles. However, it is not good practice to edit your `.html` files for a website while in production. Instead, make a web server that dynamically pulls the articles from a database and serves them based on the url entered. The html file would be rendered on the server using the database, and delivered to the user as an html file. This eliminates the need for writers to deal with html files, since you would only need to interact with the database to add articles. I would recommend Node.js. Sorry for the formatting, I am on mobile. — , Jan 30 '20 at 16:26

EyveG · Accepted Answer · 2020-01-30T21:14:01.517

Substantial edits based on comment:

It sounds like you have a folder full of HTML files and want them named based on the headline of the article in the file.

I would use the beautiful soup library to parse the individual files and HTML, something like:

import os
#this assumes that you have the path to the folder stored in the variable name directory
for filename in os.listdir(directory):
    if filename.endswith(".html"): 
        with open(filename, 'r') as html:
            html_text = html.read()
            soup = BeautifulSoup(html, 'html.parser')
            headline = soup.find(itemprop='headline').text
            os.rename(filename, headline+'.html')

There are some assumptions in this code too: that all your files will have a headline with the itemprop headline and that only the headline has the itemprop headline. If these two assumptions aren't correct you'll need to use other methods from beautiful soup to find the headlines each time. This will require finding some feature that is always the same about the headline tag and searching by that.

for more details on how to use beautiful soup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Rename HTMLs based on DIV

1 Answers1