Parsing a HTML page with python

Question

I would like to do parsing in a website source code like this

If (something="<BODY>"):
 while (something!="</BODY>"):
  if (something="https")   :
    put the word on a list

The thing is I don't know a way to parse(I mean the function with which I read the source code). I have the source code in an object i.e MyObj

Which is the best way to do this ?

score 3 · Accepted Answer · answered Oct 07 '12 at 02:49

3

Use an HTML parsing library to parse the HTML. Two popular, good ones are beautifulsoup and lxml.

answered Oct 07 '12 at 02:49

Jeff

Is there a way to parse a HTML file without using these 2 libraries ? With regex? – george mano Oct 07 '12 at 03:08
@georgemano: Regex is not the right tool to *parse* HTML. – Blender Oct 07 '12 at 03:11
@Is there any way to parse other than using external libraries like `beautifulsoup` and `lxml`? – george mano Oct 07 '12 at 03:16
Not that I'm aware of. Here's an infamous answer as to why you should not try to roll your own parser / regexs for this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Jeff Oct 07 '12 at 13:00
See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects or http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all – Jeff Oct 07 '12 at 23:35
`xml.dom.minidom` is included in Python – JVE999 Jul 31 '14 at 23:48

score 2 · Answer 2 · answered Oct 07 '12 at 02:49

2

Beautiful Soup is the best HTML parsing library I've used, take a look at it.

answered Oct 07 '12 at 02:49

wong2

Is there a way to parse a HTML file without using this library ? With regex or something? – george mano Oct 07 '12 at 03:08

2 Answers2