Iterate through multiple files and append text from HTML using Beautiful Soup

Question

I have a directory of downloaded HTML files (46 of them) and I am attempting to iterate through each of them, read their contents, strip the HTML, and append only the text into a text file. However, I'm unsure where I'm messing up, though, as nothing gets written to my text file?

import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
        markup = (path)
        soup = BeautifulSoup(markup)
        with open("example.txt", "a") as myfile:
                myfile.write(soup)
                f.close()

-----update---- I've updated my code as below, however the text file still doesn't get created.

import os
import glob
from bs4 import BeautifulSoup
path = "/"
for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(markup)
    with open("example.txt", "a") as myfile:
        myfile.write(soup)
        myfile.close()

-----update 2-----

Ah, I caught that I had my directory incorrect, so now I have:

import os
import glob
from bs4 import BeautifulSoup

path = "c:\\users\\me\\downloads\\"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(markup)
    with open("example.txt", "a") as myfile:
        myfile.write(soup)
        myfile.close()

When this is executed, I get this error:

Traceback (most recent call last):
  File "C:\Users\Me\Downloads\bsoup.py, line 11 in <module>
    myfile.write(soup)
TypeError: must be str, not BeautifulSoup

I fixed this last error by changing

myfile.write(soup)

to

myfile.write(soup.get_text())

-----update 3 ----

It's working properly now, here's the working code:

import os
import glob
from bs4 import BeautifulSoup

path = "c:\\users\\me\\downloads\\"

for infile in glob.glob(os.path.join(path, "*.html")):
    markup = (infile)
    soup = BeautifulSoup(open(markup, "r").read())
    with open("example.txt", "a") as myfile:
        myfile.write(soup.get_text())
        myfile.close()

What exactly is `f`? Seems like you used to open the HTML files before (which you should do), but then changed the code. Also, you are not stripping the HTML. — Lev Levitsky, Apr 26 '13 at 20:08
I meant to write 'myfile.close()' - sorry, I can't seem to figure this one out. is my 'infile in glob.glob(os.path.join(path, "*.html"):' line correct? That will iterate through the directory, right? — , Apr 26 '13 at 20:17
That part seems correct except for the missing closing bracket. — Lev Levitsky, Apr 26 '13 at 20:18
And soup = BeautifulSoup(markup) is what strips the HTML, I thought? — , Apr 26 '13 at 20:20
That should create a BeautifulSoup object, which contains parsed HTML tree and handy methods for accessing the data. But you are not creating it correctly, you need to open the file and give it the file object, as in the answer below. — Lev Levitsky, Apr 26 '13 at 20:28
Ah, is specifying 'lxml' necessary? In other words, I didn't install lxml, should I? — , Apr 26 '13 at 20:38

score 1 · Accepted Answer · answered Apr 26 '13 at 20:05

1

actually you are not reading html file, this should work,

soup=BeautifulSoup(open(webpage,'r').read(), 'lxml')

answered Apr 26 '13 at 20:05

Moj

6,137
2
24
36

score 0 · Answer 2 · edited May 23 '17 at 11:57

If you want to use lxml.html directly here is a modified version of some code I've been using for a project. If you want to grab all the text, just don't filter by tag. There may be a way to do it without iterating, but I don't know. It saves the data as unicode, so you will have to take that into account when opening the file.

import os
import glob

import lxml.html

path = '/'

# Whatever tags you want to pull text from.
visible_text_tags = ['p', 'li', 'td', 'h1', 'h2', 'h3', 'h4',
                     'h5', 'h6', 'a', 'div', 'span']

for infile in glob.glob(os.path.join(path, "*.html")):
    doc = lxml.html.parse(infile)

    file_text = []

    for element in doc.iter(): # Iterate once through the entire document

        try:  # Grab tag name and text (+ tail text)   
            tag = element.tag
            text = element.text
            tail = element.tail
        except:
            continue

        words = None # text words split to list
        if tail: # combine text and tail
            text = text + " " + tail if text else tail
        if text: # lowercase and split to list
            words = text.lower().split()

        if tag in visible_text_tags:
            if words:
                file_text.append(' '.join(words))

    with open('example.txt', 'a') as myfile:
        myfile.write(' '.join(file_text).encode('utf8'))

Iterate through multiple files and append text from HTML using Beautiful Soup

2 Answers2

Linked