Covert HTML code in .txt files into plain text

Question

I have a folder with several hundreds of .txt files that contain HTML code. All the file names and file paths are stored in a .csv file. I would like to convert the HTML code in each of the .txt file into plain text and save the file again.

I read that html2text is a python script that would fit my needs.

Could you help how I would need to proceed?

main.py

from csv import DictReader
import requests
from bs4 import BeautifulSoup
import html2text

with open('Test.csv', 'r') as read_obj:
    csv_dict_reader = DictReader(read_obj)
    for row in csv_dict_reader:
        r = requests.get(row['FilePath'])
        content = r.content
        h = html2text.HTML2Text()

Test.csv

| FilePath,File | 
| -------- | 

| file:///C:/Users/UserUser/Desktop/Files/FirstFile.txt,FirstFile| 

| file:///C:/Users/UserUser/Desktop/Files/SecondFile.txt,SecondFile|

So what's the problem? Looks good to me. Are you getting any errors? — fsimonjetz, Jun 13 '21 at 11:39
Thank you for the comment. I am not sure how to save the converted text into a .txt file with the same name as the original one — kra, Jun 13 '21 at 13:44

Edo Akse · Accepted Answer · 2021-06-15T09:01:26.407

0

Updated answer:

After some discussion in the comments below, my original answer isn't going to cut it.

The structure of the file Test.csv is not something that DictReader from the CSV module can parse. This is easily solved by creating a simple file parser.

The part below the 2 methods has not changed much. Instead of parsing the results of DictReader from the CSV module, we parse the results from the function readcsv

updated code:

import html2text


h = html2text.HTML2Text()
h.ignore_links = True
h.bypass_tables = False

def cleanline(instring: str) -> list:
    """
    removes the offending crap and returns a list of strings
    """
    return instring.replace('|', '').replace('file:///', '').strip().split(',')


def readcsv(filename: str) -> list:
    """
    read the CSV file and create a list of dict items based on it.
    the result will be similar to what DictReader from the CSV module did, 
    but tailored to the specific file formatting that you are processing.
    """
    result = []
    with open(filename) as csv_infile:
        # get headers & clean the line
        header_list = cleanline(csv_infile.readline())
        
        # skip the line "| -------- |" by just reading the line and not processing it
        # note that this is not actually needed as the logic below
        # only handles lines that contain a comma character
        csv_infile.readline()

        # process the rest of the lines
        for line in csv_infile:
            # the check below is to check if it's an empty line or not
            # (by looking for the comma separator)
            if ',' in line:
                # basically I use the header_list to turn the current line
                # into a dict and add it to the result list
 
                # set/reset values
                line_list = cleanline(line)
                line_dict = {}
 
                # use the index to get the header from the headerline
                for index, item in enumerate(line_list):
                    line_dict[header_list[index]] = item
                result.append(line_dict)
    return result


for row in readcsv('Test.csv'):
    print(row)
    infilename = row['FilePath']
    # create a filename based on the File column
    outfilename = f"{row['File']}.txt"
    with open(infilename) as html_infile:
        text = h.handle(html_infile.readlines())
    with open(outfilename, 'w') as html_outfile:
        html_outfile.write(text)

Original answer:

You are missing the final part, taken from the docs.

Note the change in the assignment to the content variable from content = r.content to content = r.text.

I also added a print statement so you can see the difference between content and text.

from csv import DictReader
import requests
import html2text


with open('Test.csv', 'r') as read_obj:
    csv_dict_reader = DictReader(read_obj)
    for row in csv_dict_reader:
        r = requests.get(row['FilePath'])
        content = r.text
        print(content)
        h = html2text.HTML2Text()
        h.ignore_links = True
        h.bypass_tables = False
        text = h.handle(content)
        print(text)
        # edit to save the converted text to a file
        # for the filename I'm using the url, with some stripping
        # you need to test this code though as I wrote it on mobile
        with open(row['FilePath'].replace('/', '_').replace(':', ''), 'w') as outfile:
            outfile.write(text)

The above answer was on the false assumption that HTTP requests were being preformed. Below is the adjusted answer after comment interaction.

from csv import DictReader
import html2text


h = html2text.HTML2Text()
h.ignore_links = True
h.bypass_tables = False

with open('Test.csv', 'r') as read_obj:
    csv_dict_reader = DictReader(read_obj)
    for row in csv_dict_reader:
        infilename = row['FilePath']
        outfilename = row['File']  # I'm using this as it seems this column is what this is meant for
        with open(infilename) as html_infile:
            text = h.handle(html_infile.readlines())
        with open(outfilename, 'w') as html_outfile:
            html_outfile.write(text)

edited Jun 15 '21 at 09:01

answered Jun 13 '21 at 12:35

Edo Akse

4,051
2
10
21

Thank you! To finalize: How can I save the converted text into the a new .txt file with the same name as the original .txt file? – kra Jun 13 '21 at 13:43
Let me refer you to another SO [answer here.](https://stackoverflow.com/a/5214587/9267296) Do take note that if you're dealing with different character sets, you might run into issues with missing characters. Research `python character encoding` if your resulting text has any �� characters... – Edo Akse Jun 13 '21 at 13:50
Thank you. Would it be possible to help me with a solution how to add this into the code above? Would I need to add? text_file = open(r, "w") text_file.write(text) text_file.close() – kra Jun 13 '21 at 14:15
That would work, but I would advise to use `with` just os you did for the reading of the CSV file. Gimme a sec to add and example to my answer. – Edo Akse Jun 13 '21 at 14:49
Thank you. One more question: Why is it necessary to add replace('/', '_').replace(':', '')? – kra Jun 13 '21 at 15:04
That's in case `row['FilePath']` contains the `http://` part. Most OS' do not like filenames containing `:`, and the `/` is used in the file path. Try running it without the replace and see what happens – Edo Akse Jun 13 '21 at 15:18
Depending on what other columns the CSV file contains, you might want to use another value for the file name. I used the one I knew exists... Also, for testing, I advise to use a CSV with only a couple of rows, not one with a couple of thousand rows. – Edo Akse Jun 13 '21 at 15:30
Thank you. I am receiving the error r = requests.get(row['FilePath']) KeyError: 'FilePath' Process finished with exit code 1 – kra Jun 13 '21 at 15:32
That means the CSV file doesn't have a FilePath column. – Edo Akse Jun 13 '21 at 15:46
Note for next time you ask a question on stackoverflow, it helps if you provide a sample of the input data. For example for the CSV file you read from, provide the header row, with one or two rows of (sanitized) data... – Edo Akse Jun 13 '21 at 16:04
Thank you. You are right, sorry for that. I added the first few lines in the question above. Actually I included a column with the name FilePath... – kra Jun 13 '21 at 16:23
So what are the contents for the files that the CSV file points to? Since you were using the requests module, I assumed that the CSV file contained a list of URLs. The requests module is used for HTTP requests, but if the files the CSV file points to contain HTML, then the easy you need to deal with them in the same way as you deal with the CSV file. There's no need to use the requests module if you're only local files... – Edo Akse Jun 13 '21 at 16:35
Sorry for the misunderstanding. The csv flies points to the .txt files that are stored locally. – kra Jun 13 '21 at 16:48
How would you change the code after knowing that the files are .txt files and stored locally? – kra Jun 13 '21 at 16:48
These files contain HTML? – Edo Akse Jun 13 '21 at 16:54
Yes, exactly. The .txt files contain HTML code. – kra Jun 13 '21 at 16:58
So basically I would like to convert the HTML code in the .txt file into plain text – kra Jun 13 '21 at 16:58
I've updated my answer. Please test it as I'm posting from mobile and autocorrect is a pain in the behind while trying to write code – Edo Akse Jun 13 '21 at 17:56
Thank you! I am still receiving the error infilename = row['FilePath'] KeyError: 'FilePath' – kra Jun 13 '21 at 18:00
So, the contents of the CSV file, are they **exactly** as you put in your post? Does it include the `|`as well in the file? – Edo Akse Jun 13 '21 at 19:01
No they do not include the I Otherwise they are exactly the same yes – kra Jun 13 '21 at 19:51
I'll need to actually test this, as I don't know the behavior of `DictReader`in this case of the top of my head. I'll update the answer tomorrow... – Edo Akse Jun 13 '21 at 20:11
answer is updated. I basically made a simple file parser, as the file `Test.csv` is not something that `DictReader` can parse... – Edo Akse Jun 14 '21 at 09:20

Covert HTML code in .txt files into plain text

1 Answers1

Updated answer:

Original answer: