1

I have a folder full of html files as follows:

aaa.html
bbb.html
ccc.html
....
......
.........
zzz.html

All these htmls are created using a python script, and hence follow the same template.

Now, I want to link all these html files, for which I already have the placeholders in the html as follows:

<nav>
    <ul class="pager">
        <li class="previous"><a href="#">Previous</a></li>
        <li class="next"><a href="#">Next</a></li>
    </ul>
</nav>

I want to fill these placeholders with the filenames in the folder. For example, bbb.html will have

<nav>
    <ul class="pager">
        <li class="previous"><a href="aaa.html">Previous</a></li>
        <li class="next"><a href="ccc.html">Next</a></li>
    </ul>
</nav>

and the ccc.html file will contain:

<nav>
    <ul class="pager">
        <li class="previous"><a href="bbb.html">Previous</a></li>
        <li class="next"><a href="ddd.html">Next</a></li>
    </ul>
</nav>

And so on for rest of the files. Can this task be done using python? I don't even know how to start with. Any hints, suggestions would be really helpful.

kingmakerking
  • 2,017
  • 2
  • 28
  • 44
  • is the order of the html files truly alphabetic? If you have AAA.html and aaa.html, which comes first? – philshem Apr 07 '17 at 08:20
  • 2
    You can use `os.walk` to list of files in that directory, sort them with custom sorting function that you use for template in web scraping then iterate over that list read each file with beautiful soup to change those 2 placeholders to previous and next elementes on list. – Tomasz Plaskota Apr 07 '17 at 08:23
  • @philshem The order really doesn't matter. It is just that one file has to be linked with other two. So, any order would do. – kingmakerking Apr 07 '17 at 08:27

2 Answers2

2

You can use the beautifulsoup library to modify html:

from bs4 import BeautifulSoup

file_names = ['bbb.html', 'ccc.html', ... , 'yyy.html']
# we exclude first and last files (not sure what to do with them ?)

for ind, file_name in enumerate(file_names):
    with open(file_name, 'r+') as f:
        soup = BeautifulSoup(f.read(), 'html.parser')
        # we suppose that there is only one link for previous and next
        soup.find_all(class_='previous')[0]['href'] = file_names[ind - 1]
        soup.find_all(class_='next')[0]['href'] = file_names[ind + 1]
        # erase contents and replace with new html
        f.seek(0)
        f.truncate()
        f.write(soup.prettify("utf-8"))  # to get readable HTML

If the filenames aren't as consistent as in your example, and you want to generate the list from the files in the directory, you can use os.walk or glob.glob.

TrakJohnson
  • 1,755
  • 2
  • 18
  • 31
1

You can replace elements from your template by looping over the file list, with list wrapping. Here's an example for aaa.html using aaa,bbb,ccc:

#f = ['aaa.html','bbb.html','ccc.html']
f = sorted(['aaa.html','bbb.html','ccc.html'])  # explicit sorting

t = """<nav>
    <ul class="pager">
        <li class="previous"><a href="#">Previous</a></li>
        <li class="next"><a href="#">Next</a></li>
    </ul>
</nav>"""  # sample aaa.html file

for i in xrange(len(f)-1):
    #print f[i]
    t = t.replace('<li class="previous"><a href="#">Previous','<li class="previous"><a href="'+f[(i % len(f)) -1]+'">Previous')
    t = t.replace('<li class="next"><a href="#">Next','<li class="next"><a href="'+f[(i % len(f)) +1]+'">Next')

print t

To do the list-wrapping I use this concept (After zzz comes aaa)

Gives as an output for aaa.html:

<nav>
    <ul class="pager">
        <li class="previous"><a href="ccc.html">Previous</a></li>
        <li class="next"><a href="bbb.html">Next</a></li>
    </ul>
</nav>

To complete the code, you'd have to loop over *.html files (see glob.glob)

Community
  • 1
  • 1
philshem
  • 24,761
  • 8
  • 61
  • 127