Fetching and transforming an HTML document

Question

I want to create an script that downloads a webpage, transforms some code (from HTML to TeX), and then prints it to a file. I got the first two parts, but I don't know how to print it to a file.

I use regular expressions, but I have heard lots of bad things about them, but… how could I do this with one of those HTML parsers?

I take a list of urls urllist, and a list of lists for regexes regexlist (the first item is the search, and the second item is the replacement). I loop over all of them and save each url (after transforming through the regexes) into one entry in a dictionary. From there, what I would like to do is print each entry of the dictionary to a different file.

By the way, if you want to show a different solution with another language to show its beauties, do it without problem, I'm looking for different languages :) And, in case you answer in Python, you can definitely optimize this by using a good solution.

In any case, here's my code. Any suggestion would be great (I'm completely new in this language, and programming in general).

import urllib
import re

outputpath = "~/Desktop/foo/"
webpath  = "http://learnyouahaskell.com/"
namelist = [ "introduction",
             "starting-out" ]
urllist = []
for i in range(len(namelist)):
  urllist.append(webpath + namelist[i])

regexlist = [ [ r"^<!DOCTYPE([\s\S]+?)<h1" , "\\startweb\n\n<h1" ],
              [ r"</p>[\s\n]*?<div class[\s\S]+?</script>\n</body>\n</html>", "</p>\n\n\\stopweb"],
              [ r"<h1.*?>(.+?)</h1>" , r"\chapter{\1}\n" ],
              [ r"<h2>(.+?)</h2>" , r"\section{\1}\n" ],
              [ r"<i>(.+?)</i>" , r"\emph{\1}" ],
              [ r"<em>(.+?)</em>" , r"\\bold{\1}" ],
              [ "<p>", "\n" ], [ "</p>", "\n" ],
              [ "<pre name=\"code\" class=\"haskell: (.+?)\">([\\s\\S]+?)\n?</pre>" , r"\n\starthaskell[\1]\2\n\stophaskell\n" ],
              [ "\n\n\\haskell" , "\n\haskell" ],
              [ "<span class=\"fixed\">(.+?)</span>" , r"\\typehaskell{\1}"],
              [ "<span class=\"label function\">(.+?)</span>" , r"\\haskellfunction{\1}"],
              [ "<span class=\"(class label|label class)\">(.+?)</span>" , r"\\haskellclass{\1}"],
              [ "<span class=\"label type\">(.+?)</span>" , r"\\haskelltype{\1}"],
              [ "<img src=\"(http://s3.amazonaws.com/lyah/)(.+?).png\" (alt|class)=\"(.+?)\" (class|alt)=\"(.+?)\" width=\"(\d+)\" height=\"(\d+)\">" , r"\n\placeimage[\2]{url=\1\2.png,\3=\4,\5=\6,width=\7pt,height=\8pt}\n" ],
              [ "<a href=\"(.+?)\">(.+?)</a>" , r"\\url[\1]{\2}" ],
              [ "<a.*?></a>", "" ],
              [ "#" , "\#" ],
              [ "&amp;" , "&" ],
              [ "&hellip;" , "\dots" ],
              [ "&gt;" , ">" ],
              [ "&lt;" , "<" ]
            ]

finaldoc = {}
for i in range(len(namelist)):
  htmlfile = urllib.urlopen(urllist[i])
  htmltext = htmlfile.read()
  for regex in regexlist:
    searchpattern  = regex[0]
    replacepattern = regex[1]
    htmltext = re.sub(searchpattern, replacepattern, htmltext)
  finaldoc[namelist[i]] = htmltext

Watch out, "You can't parse html with regex" people are coming — akalikin, Sep 21 '15 at 12:00
I know. But for me this works (I'm not parsing tons of webpages, just a few, with a finite number of tags, etc.). In case it's not okey, I hope those people offer an alternative :) — Manuel, Sep 21 '15 at 12:02
Might be better suited for [Code Review](http://codereview.stackexchange.com) — 301_Moved_Permanently, Sep 21 '15 at 12:03
You probably can solve this easily with [*pandoc*](https://hackage.haskell.org/package/pandoc-1.15.0.6), which is both a document conversion application and a Haskell library. Other libraries such as [*tagsoup*](https://hackage.haskell.org/package/tagsoup) can help if you need to adjust the HTML before the conversion. — duplode, Sep 21 '15 at 15:01
for printing the result to a file, see this SO question: http://stackoverflow.com/q/5214578/866915 — ErikR, Sep 21 '15 at 15:36

Fetching and transforming an HTML document

0 Answers0