1

I have tens of thousands of HTML documents saved to my computer, and i need to parse them all using BeautifulSoup, using the same consistent tags between each document.

Currently I iterate through my folder of HTML files, open each file, parse it, then close. But the time it takes to open/parse/close is too long. I tried to save several HTML documents in one text document and "redo" the opening and closing HTML tags, but im not totally sure how parsing works, so i wasnt sure about rearranging the document without messing up the parsing process.

Is there any sort-of standardized method of doing this? If i could combine as many HMTL codes into one text document as possible, i think i would make this portion of the process go much faster.

EDIT: There are only as many as 100 individual 'items' that i am looking for in each html document, so i can only parse as many as 100 at a time. Its not that im trying to parse through my documents any quicker, but instead i want to save as many html documents into one text file as possible, with hopes of being able to parse 1000 items at a time, or many more if possible.

  • Can you clarify what you are asking. If you are complaining that the file handling (open, read close) is taking too long - you will still have to open each file to combine them so the time cost will be paid somewhere. But this is not very clear – PyNEwbie Jan 08 '17 at 22:20
  • Opening and closing is not expensive so you won't save much time concatenating the files. You may benefit from running multiple processes to do the reading and parsing (the `multiprocessing` module for instance - but there are good and bad ways to do that!). But eventually you will be limited by the speed of your disk. – tdelaney Jan 08 '17 at 22:23
  • As far as improving the parsing speed, please see http://stackoverflow.com/questions/25539330/speeding-up-beautifulsoup. It sounds like a good use case for `SoupStrainer`.. – alecxe Jan 08 '17 at 22:24
  • check out my edit! –  Jan 08 '17 at 22:59

0 Answers0