Reading text from multiple html files and consolidate into a different html file python script

Question

I am writing a python script in which a loop will run and look for specific html pages with the string '_CriteriaOutput.html' in the name over multiple directories. Each directory contains multiple html files and 4-5 html files with the string mentioned above. What I want to do is to read these html files with '_CriteriaOutput.html' name and consolidate it into a different html file. I'll give my code below (whatever i have done so far). This code reads the source code of the html files which is useless for me. I want only text (if any present in the html file)

import os
import fileinput

NightlyLogs = r'C:/Users/<user>/Desktop/Nightly_Logs/2015_07_16-0940'
dir = [fol for fol in os.listdir(NightlyLogs) if os.path.isdir(os.path.join(NightlyLogs, fol))]
dir = sorted(dir)
for folder in dir:
    HtmlLoc = r'%s/%s' %(NightlyLogs, folder)
    abc = [file for file in os.listdir(HtmlLoc) if file.endswith('_CriteriaOutput.html')]
    for one in abc:
        HtmlFile = r'%s/%s' %(HtmlLoc, one)
        open_file = open(HtmlFile, 'r')
        print open_file.read()

NightlyLogs is a location which contains folders with CL (changelist) names (e.g 876564 or 865664 etc). Each HTML file e.g A_CriteriaOutput.html or B_CriteriaOutput.html name contains information for a specific series (let say A or B or C etc.) and each folders with a specific CL name contains similar _CriteriaOutput.html files which contains information only for that CL. I want to make a Table with CL as column and A, B, C, D, E as row which will contain the info for that particular series. I have tried to be specific but in case you think some information is missing please help me learn. I'll try to provide as much info as i can. Thanks.

possible duplicate of [Strip HTML from strings in Python](http://stackoverflow.com/questions/753052/strip-html-from-strings-in-python) — SuperBiasedMan, Jul 20 '15 at 08:22
I couldn't find the complete answer to my question in the above mentioned threads besides my question is different and is more about creating html tables. — Anurag Tiwary, Jul 20 '15 at 08:40
The information in there should help you get further though. It doesn't explain how to consolidate the information into a table but it has good information on how to read the information from the files. — SuperBiasedMan, Jul 20 '15 at 08:55
Thank you pointing that out. You are actually right. Once i'll get any replies i'll provide more information on that. — Anurag Tiwary, Jul 20 '15 at 09:08

adrianus · Answer 1 · 2015-07-20T11:06:28.577

So your question is

I want to make a Table with CL as column and A, B, C, D, E as row which will contain the info for that particular series.

Something like this?

    876564 | 865664 | ...
A |  ...   |  ...   | ...
B |  ...   |  ...   | ...

If I read your question correctly, changelist names (876564, ...) are folder names and A, B, ... are the part of the filename, before _CriteriaOutput.html.

I would first collect the data from all the files, in a similar way you did, and at the end you can print them in any way you want.

import os
import fileinput

def pretty_print(change_list):
    change_names = []
    for category_name, category_list in sorted(change_list.items()):
        for change_name in category_list.keys():
            if change_name not in change_names: change_names.append(change_name)
    header = ['']
    header.extend(change_names)
    list_of_lists = []
    list_of_lists.append(header)
    for category, category_list in sorted(change_list.items()):
        titles = [category]
        for name in change_names:
            try:
                titles.append(category_list[name])
            except KeyError:
                titles.append('-')
        list_of_lists.append(titles)

    for line in list_of_lists:
        print '\t'.join(line)

change_list = {}
NightlyLogs = r'C:/Users/<user>/Desktop/Nightly_Logs/2015_07_16-0940'
dir = [fol for fol in os.listdir(NightlyLogs) if os.path.isdir(os.path.join(NightlyLogs, fol))]
dir = sorted(dir)
for folder in dir:
    HtmlLoc = r'%s/%s' %(NightlyLogs, folder)
    abc = [file for file in os.listdir(HtmlLoc) if file.endswith('_CriteriaOutput.html')]
    for one in abc:
        change_name = one.split('_')[0]
        if change_name not in change_list:
            change_list[change_name] = {}
        HtmlFile = r'%s/%s' %(HtmlLoc, one)
        open_file = open(HtmlFile, 'r')
        file_content = open_file.read()
        print change_name, '|', folder, '|', file_content
        change_list[change_name][folder] = file_content

print '\nTable of changes:'
pretty_print(change_list)

Output of some example data (first the files / folder names / content are printed while reading, and later with pretty_print() the table gets printed):

A | 876564 | foo
B | 876564 | foo B
A | 876565 | foobar
B | 876565 | foo
A | 876566 | bar
C | 876566 | bar C

Table of changes:
    876564  876565  876566
A   foo     foobar  bar
B   foo B   foo     -
C   -       -       bar C

Thanks for the response @adrianus. You understood it correctly but looks like it still needs some workaround. I'll try it out and let you know the output. I'll also try to make some changes and come up with something. Thanks once again. — Anurag Tiwary, Jul 20 '15 at 10:38
@AnuragTiwary You're welcome, just post it here if there's still a problem. Consider chosing an accepted answer if it helped you :-) — adrianus, Jul 22 '15 at 05:08

Reading text from multiple html files and consolidate into a different html file python script

1 Answers1