-2

I have really much offline html files and I need get from them name, adress etc. And create CSV.

I first try to do it witch batch - example:

for /r %%i in (*) DO (
  findstr /o "name" %%i >> results.txt
  ECHO ; >> results.txt

  findstr /o "STREET" %%i >> results.txt
  ECHO ; >> results.txt

  etc

ECHO xxxendlinexxx >> results.txt                                       
)

It works, but this give me long file what need hard work with Regular expresion... I think there must be better way how to read TAG content in HTML.

I found Python HTML parser:

from html.parser import HTMLParser

But I dont know hot to use for offline file and specific TAG (id="something"). I googling, watch tuts on youtube, but I dont find easy a understandable solution.

Can you help? Best with example:

  1. How open file
  2. How find content in specific tag
  3. Save content to another file

Thank you for help.

Firejs
  • 319
  • 1
  • 5
  • 10
  • 2
    If you don't provide an example html file with the required data and an example of what you want your csv to look like, how do you expect us to create a reasonable solution? – Compo Nov 04 '16 at 15:12

2 Answers2

0

If you want use html.parser, take an example, you want create a parse who get all main title (h1):

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    my_titles = []
    target_id = ['article-1-b', 'article-2-a']
    COPY_DATA = False

    def handle_starttag(self, tag, attrs):
       if tag == 'h1':
           if 'id' in attrs and attrs['id'] in target_id:
               self.COPY_DATA = True

    def handle_data(self, data):
        if self.COPY_DATA:
            self.my_titles.append(data)
            self.COPY_DATA = False


parser = MyHTMLParser()
with open('my_file.html') as f:
    parser.feed(f.read())

print(parser.my_titles)
Anton
  • 504
  • 2
  • 6
  • Thank you, this help. But i have a litte more question - I can add more tags - like H2 etc. But in some cases i must use "id" or "class" etc. for finding the right tag, I found some code here on stacoverflow - it works, bud im not able to combinate that together in one function. Did you understand? There is http://stackoverflow.com/questions/3276040/how-can-i-use-the-python-htmlparser-library-to-extract-data-from-a-specific-div – Firejs Nov 07 '16 at 13:20
  • I don't sure to understand but I have edited code. Look and tell me. – Anton Nov 08 '16 at 12:17
-1

You can use module xml instead of html.parser to work with xml or html. It is easier.

I use module xml.etree but there are others (doc: xml)

You can read from file (ET.parse(filename)) but in example I use string.

You have to learn how to use xpath (ie. './/div[@id="something"]') to find elements.

import xml.etree.ElementTree as ET

html_string = '''<html>
<body>
<div id="something">Hello</div>
<div id="something">World</div>
</body>
</html>'''

#tree = ET.parse(filename)
tree = ET.fromstring(html_string)

divs = tree.findall('.//div[@id="something"]')

# --- screen ---

for d in divs:
    print(d.text)

# --- file ---

with open('output.txt') as f:
    for d in divs:
        f.write(d.text + '\n')
furas
  • 134,197
  • 12
  • 106
  • 148