2

I am using the following Python - Beautifulsoup code to remove html elements from a text file:

from bs4 import BeautifulSoup

with open("textFileWithHtml.txt") as markup:
    soup = BeautifulSoup(markup.read())

with open("strip_textFileWithHtml.txt", "w") as f: 
    f.write(soup.get_text().encode('utf-8'))

The question I have is how can I apply this code to every text file in a folder(directory), and for each text file produce a new text file which is processed and where the html elements etc. are removed, without having to call the function for each and every text file?

adrCoder
  • 3,145
  • 4
  • 31
  • 56

2 Answers2

2

The glob module lets you list all the files in a directory:

import glob
for path in glob.glob('*.txt'):
    with open(path) as markup:
        soup = BeautifulSoup(markup.read())

    with open("strip_" + path, "w") as f: 
        f.write(soup.get_text().encode('utf-8'))

If you want to also do that for every subfolder recursively, check out os.walk

Arthur
  • 4,093
  • 3
  • 17
  • 28
1

I would leave that work to the OS, simply replace the hardcoded input file with input from external source, in argv array, and invoke the script inside a loop or with a regular expression that matches many files, like:

from bs4 import BeautifulSoup
import sys

for fi in sys.argv[1:]:
    with open(fi) as markup:
        soup = BeautifulSoup(markup.read())

    with open("strip_" + fi, "w") as f: 
        f.write(soup.get_text().encode('utf-8'))

And run it like:

python script.py *.txt
Birei
  • 35,723
  • 2
  • 77
  • 82
  • Thank you, seems to be working. If the .txt files are not in the same folders, would something like python script.py 'path-to-folder/*.txt' work? – adrCoder Feb 19 '15 at 14:17
  • Not now, because the prefix would be inserted at the beginning of the full path. You can solve it appending it at the end, like `with open(fi + "_strip", "w") as f:`. The alternative is to handle it inside the script, doing a partition of the path from the file, add the prefix and join it again. There is a standard `python` module to handle it. – Birei Feb 19 '15 at 14:23