Python - beautifulsoup, apply in every text file in folder and produce new text file

Question

I am using the following Python - Beautifulsoup code to remove html elements from a text file:

from bs4 import BeautifulSoup

with open("textFileWithHtml.txt") as markup:
    soup = BeautifulSoup(markup.read())

with open("strip_textFileWithHtml.txt", "w") as f: 
    f.write(soup.get_text().encode('utf-8'))

The question I have is how can I apply this code to every text file in a folder(directory), and for each text file produce a new text file which is processed and where the html elements etc. are removed, without having to call the function for each and every text file?

score 2 · Answer 1 · answered Feb 19 '15 at 14:06

The glob module lets you list all the files in a directory:

import glob
for path in glob.glob('*.txt'):
    with open(path) as markup:
        soup = BeautifulSoup(markup.read())

    with open("strip_" + path, "w") as f: 
        f.write(soup.get_text().encode('utf-8'))

If you want to also do that for every subfolder recursively, check out os.walk

score 1 · Accepted Answer · answered Feb 19 '15 at 14:01

1

I would leave that work to the OS, simply replace the hardcoded input file with input from external source, in argv array, and invoke the script inside a loop or with a regular expression that matches many files, like:

from bs4 import BeautifulSoup
import sys

for fi in sys.argv[1:]:
    with open(fi) as markup:
        soup = BeautifulSoup(markup.read())

    with open("strip_" + fi, "w") as f: 
        f.write(soup.get_text().encode('utf-8'))

And run it like:

python script.py *.txt

answered Feb 19 '15 at 14:01

Birei

35,723
2
77
82

Thank you, seems to be working. If the .txt files are not in the same folders, would something like python script.py 'path-to-folder/*.txt' work? – adrCoder Feb 19 '15 at 14:17
Not now, because the prefix would be inserted at the beginning of the full path. You can solve it appending it at the end, like `with open(fi + "_strip", "w") as f:`. The alternative is to handle it inside the script, doing a partition of the path from the file, add the prefix and join it again. There is a standard `python` module to handle it. – Birei Feb 19 '15 at 14:23

Python - beautifulsoup, apply in every text file in folder and produce new text file

2 Answers2

Linked