How do I extract text from multiple HTML files and output it into one CSV?

Question

I am trying to extract certain texts based on surrounding words/patterns and output the information to a file called sample.csv.

For example, I have a directory of files:

file1.html file2.html file3.html

Each file contains the following structure. For example, file1.html:

<strong>Hello world</strong>

<p><strong>Name:</strong> John Smith</p>
<p>Some text</p>

<p><strong>Location</strong></p>

<blockquote>
<p>122 Main Street &amp; City, ST 12345 &gt;</p>
</blockquote>
<p>Some text</p>

Based on the above HTML structure, I want to output it to a sample.csv file that looks like this:

filename,name,location
file1.html,John Smith,122 Main Street
file2.html,Mary Smith,123 North Road
file3.html,Kate Lee,90 Winter Lane

I have the following python code:

import os
import csv
import re

csv_cont = []
directory = os.getcwd()
for root,dir,files in os.walk(directory):
    for file in files:

        if file.endswith(".html"):
            f = open(file, 'r')
            
            name = re.search('<p><strong>Name:</strong>(.*)</p>', f)
            
            location = re.search('<p><strong>Location</strong></p><blockquote><p>(.*)&amp;', f)

            tmp = []
            tmp.append(file)
            tmp.append(name)
            
            tmp.append(location)

            csv_cont.append(tmp)    
            f.close()


#Change name of test.csv to whatever you want
with open("sample.csv", 'w', newline='') as myfile:
     wr = csv.DictWriter(myfile, fieldnames = ["filename", "name", "location"], delimiter = ',')
     wr.writeheader()
     wr = csv.writer(myfile)
     wr.writerows(csv_cont)

I am getting the following error:

    return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object

What is the issue here?

You probably need `s = f.read()` and then do the searches against `s`. — Ouroborus, Dec 18 '20 at 00:10
I'm pretty sure you shouldn't be parsing html using regular expressions. https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags Use a proper parser instead. — NotAName, Dec 18 '20 at 00:16

Aaj Kaal · Accepted Answer · 2020-12-18T02:02:47.190

0

You need to read the file and run search against it. Replace

f = open(file, 'r')
name = re.search('<p><strong>Name:</strong>(.*)</p>', f)
            
location = re.search('<p><strong>Location</strong></p><blockquote><p>(.*)&amp;', f)

with

f = open(file, 'r')
file_content = f.read()
name = re.search('<p><strong>Name:</strong>(.*)</p>', file_content).group(1)
location = re.search('<p><strong>Location</strong></p>\n\n<blockquote>\n<p>(.*)&amp', file_content).group(1)

Corrected: Use file_content instead of f in your search.

Use group() to capture

Output:

filename,name,location
file1.html, John Smith,122 Main Street

edited Dec 18 '20 at 02:02

answered Dec 18 '20 at 00:37

Aaj Kaal

1,205
1
9
8

I'm getting the following error: ```f.close() AttributeError: 'str' object has no attribute 'close'``` – hy9fesh Dec 18 '20 at 01:03
I am getting a CSV that looks like this: `filename,name,location file1.html,"",` The location isn't showing up at all. – hy9fesh Dec 18 '20 at 01:37

How do I extract text from multiple HTML files and output it into one CSV?

1 Answers1