BeautifulSoup on multiple .html files

Question

I'm trying to extract information between fixed tags with BeautifulSoup by using the model suggested here enter link description here

I have a lot of .html files in my folder and I want to save results obtained with a BeautifulSoup script into another folder in the form of individual .txt files. These .txt files should have the same name as original files but would contain only extracted content. The script I wrote (see below) processes files successfully but does not write extracted bits out to individual files.

import os
import glob
from bs4 import BeautifulSoup

dir_path = "C:My_folder\\tmp\\"

for file_name in glob.glob(os.path.join(dir_path, "*.html")):
    my_data = (file_name)
    soup = BeautifulSoup(open(my_data, "r").read())
    for i in soup.select('font[color="#FF0000"]'):
        print(i.text)
        file_path = os.path.join(dir_path, file_name)
        text = open(file_path, mode='r').read()
        results = i.text
        results_dir = "C:\\My_folder\\tmp\\working"
        results_file = file_name[:-4] + 'txt'
        file_path = os.path.join(results_dir, results_file)
        open(file_path, mode='w', encoding='UTF-8').write(results)

Martijn Pieters · Accepted Answer · 2019-09-26T14:33:00.533

Glob returns full paths. You are re-opening the file for each font element you find, replacing the contents of the file. Move opening of the file outside the loop; you should really use files as context managers (with the with statement) to ensure they are closed properly again too:

import glob
import os.path
from bs4 import BeautifulSoup

dir_path = r"C:\My_folder\tmp"
results_dir = r"C:\My_folder\tmp\working"

for file_name in glob.glob(os.path.join(dir_path, "*.html")):
    with open(file_name) as html_file:
        soup = BeautifulSoup(html_file)

    results_file = os.path.splitext(file_name)[0] + '.txt'
    with open(os.path.join(results_dir, results_file), 'w') as outfile:        
        for i in soup.select('font[color="#FF0000"]'):
            print(i.text)
            outfile.write(i.text + '\n')

@meshfields thanks for point that out; I must’ve forgotten to join the base filename to it. — Martijn Pieters, Sep 26 '19 at 14:33

score 1 · Answer 2 · answered Dec 24 '14 at 14:49

import glob
import os
from BeautifulSoup import BeautifulSoup

input_dir = "/home/infogrid/Desktop/Work/stack_over/input/"
#- Already Present on system.
output_dir = "/home/infogrid/Desktop/Work/stack_over/output/"

for file_name in glob.glob(input_dir+ "*.html"):
    with open(file_name) as fp:
        soup = BeautifulSoup(fp)
        results_file = "%s%s.txt"%(output_dir, os.path.splitext(os.path.basename(file_name))[0])
        tmp = [i.text for i in soup.findAll('font') if i.get("color")=="#FF0000"]
        with open(results_file, 'w') as fp:        
            print "\n".join(tmp)
            fp.write("\n".join(tmp))

This works as well. I like it that there are more than one way to do perform the same task. Thanks for your suggestion. It's a shame there is no option on `Stackflow` to accept both solutions. — user3635159, Dec 24 '14 at 18:12

BeautifulSoup on multiple .html files

2 Answers2

Linked