HTML offline files get content

Question

I have really much offline html files and I need get from them name, adress etc. And create CSV.

I first try to do it witch batch - example:

for /r %%i in (*) DO (
  findstr /o "name" %%i >> results.txt
  ECHO ; >> results.txt

  findstr /o "STREET" %%i >> results.txt
  ECHO ; >> results.txt

  etc

ECHO xxxendlinexxx >> results.txt                                       
)

It works, but this give me long file what need hard work with Regular expresion... I think there must be better way how to read TAG content in HTML.

I found Python HTML parser:

from html.parser import HTMLParser

But I dont know hot to use for offline file and specific TAG (id="something"). I googling, watch tuts on youtube, but I dont find easy a understandable solution.

Can you help? Best with example:

How open file
How find content in specific tag
Save content to another file

Thank you for help.

If you don't provide an example html file with the required data and an example of what you want your csv to look like, how do you expect us to create a reasonable solution? — Compo, Nov 04 '16 at 15:12

Anton · Answer 1 · 2016-11-08T12:16:36.560

0

If you want use html.parser, take an example, you want create a parse who get all main title (h1):

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    my_titles = []
    target_id = ['article-1-b', 'article-2-a']
    COPY_DATA = False

    def handle_starttag(self, tag, attrs):
       if tag == 'h1':
           if 'id' in attrs and attrs['id'] in target_id:
               self.COPY_DATA = True

    def handle_data(self, data):
        if self.COPY_DATA:
            self.my_titles.append(data)
            self.COPY_DATA = False


parser = MyHTMLParser()
with open('my_file.html') as f:
    parser.feed(f.read())

print(parser.my_titles)

edited Nov 08 '16 at 12:16

answered Nov 04 '16 at 15:20

Anton

504
2
6

Thank you, this help. But i have a litte more question - I can add more tags - like H2 etc. But in some cases i must use "id" or "class" etc. for finding the right tag, I found some code here on stacoverflow - it works, bud im not able to combinate that together in one function. Did you understand? There is http://stackoverflow.com/questions/3276040/how-can-i-use-the-python-htmlparser-library-to-extract-data-from-a-specific-div – Firejs Nov 07 '16 at 13:20
I don't sure to understand but I have edited code. Look and tell me. – Anton Nov 08 '16 at 12:17

furas · Answer 2 · 2016-11-04T16:34:26.130

You can use module xml instead of html.parser to work with xml or html. It is easier.

I use module xml.etree but there are others (doc: xml)

You can read from file (ET.parse(filename)) but in example I use string.

You have to learn how to use xpath (ie. './/div[@id="something"]') to find elements.

import xml.etree.ElementTree as ET

html_string = '''<html>
<body>
<div id="something">Hello</div>
<div id="something">World</div>
</body>
</html>'''

#tree = ET.parse(filename)
tree = ET.fromstring(html_string)

divs = tree.findall('.//div[@id="something"]')

# --- screen ---

for d in divs:
    print(d.text)

# --- file ---

with open('output.txt') as f:
    for d in divs:
        f.write(d.text + '\n')

HTML offline files get content

2 Answers2