1

I would like to do parsing in a website source code like this

If (something="<BODY>"):
 while (something!="</BODY>"):
  if (something="https")   :
    put the word on a list

The thing is I don't know a way to parse(I mean the function with which I read the source code). I have the source code in an object i.e MyObj

Which is the best way to do this ?

george mano
  • 5,948
  • 6
  • 33
  • 43

2 Answers2

3

Use an HTML parsing library to parse the HTML. Two popular, good ones are beautifulsoup and lxml.

Jeff
  • 3,879
  • 3
  • 26
  • 28
  • Is there a way to parse a HTML file without using these 2 libraries ? With regex? – george mano Oct 07 '12 at 03:08
  • @georgemano: Regex is not the right tool to *parse* HTML. – Blender Oct 07 '12 at 03:11
  • @Is there any way to parse other than using external libraries like `beautifulsoup` and `lxml`? – george mano Oct 07 '12 at 03:16
  • Not that I'm aware of. Here's an infamous answer as to why you should not try to roll your own parser / regexs for this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Jeff Oct 07 '12 at 13:00
  • See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#kinds-of-objects or http://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all – Jeff Oct 07 '12 at 23:35
  • `xml.dom.minidom` is included in Python – JVE999 Jul 31 '14 at 23:48
2

Beautiful Soup is the best HTML parsing library I've used, take a look at it.

wong2
  • 34,358
  • 48
  • 134
  • 179