9

I'm trying to learn Python by working on a fun project - a Facebook message analyzer. I've downloaded my data off Facebook, which includes a set of html files. One of these - messages.htm - contains all of my messages. My goal is to take this html file and parse it out to output fun data like most common word, # of messages, etc.

The problem is that my messages.htm file is 270MB. I can inspect it fine in vim, but there's interesting patterns in the file and I'd like to compare the html code with how it's actually rendered properly on a browser so I can compare the code with the visuals and get a better sense of what's going on. But when I try to open this file in Firefox, FF crashes. I can open it in Chrome, but it just starts loading all the messages, and ~10 minutes in it hasn't even fully loaded one single message thread no matter how tiny the scroll bar gets. So this isn't feasible.

Is it even possible to fully render such a large and long HTML file?

ShaneOH
  • 1,454
  • 1
  • 17
  • 29
  • 1
    270MB of *source code* will possibly result in several GBs of data structures in RAM. Browser's should not even try. – Álvaro González Jul 06 '15 at 09:01
  • 1
    I see - should this be dissuading me from parsing this? I did ask an earlier question regarding it, and figure it's possible with iterative parsing (http://stackoverflow.com/questions/31225193/parsing-very-large-html-file-with-python-elementtree) I would guess that I have somewhere in the neighborhood of 800k-1M total messages, so this is a lot of data to work with but surely a feasible overall task? – ShaneOH Jul 06 '15 at 09:06
  • Is it possible for you to filter your File for the single Messages and write them into a Database or JSON-Store. This could make this large amount of Data a bit handier.. – MiBrock Jul 06 '15 at 09:06
  • @MiBrock, I'm sure it is - this is my first foray into Python so I don't have a set approach yet. Totally open to any suggestions like that, would greatly appreciate any pointing in the right direction! – ShaneOH Jul 06 '15 at 09:10
  • Just parsing it to extract data *is* feasible, as long as you don't use a regular in-memory HTML parser (a 270 MB text file itself is not a great deal). Perhaps you could use an XML pull parser *if* the HTML is not invalid. – Álvaro González Jul 06 '15 at 10:18

1 Answers1

7

You can use lynx which is a text based browser to view a large html file. I have a 139M html file and I was able to view it very easily using lynx. lynx divides the entire document into pages and is able to load any given page very quickly. It also supports hyper-linking, so navigating within the html document (which was my use case) worked like a charm.

ignite
  • 381
  • 5
  • 20