3

I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.

Can anyone give me a quick solution to this? I'm writing python scripts.

thanks

pyfunc
  • 65,343
  • 15
  • 148
  • 136
green-i
  • 315
  • 2
  • 16

2 Answers2

3
  1. You may want to check mwlib to parse the wikipedia source
  2. Alternatively, use the wikidump lib
  3. HTML screen scraping through BeautifulSoup

Ah, there is a question already on SO on this topic:

  1. Parsing a Wikipedia dump
  2. How to parse/extract data from a mediawiki marked-up article via python
Community
  • 1
  • 1
pyfunc
  • 65,343
  • 15
  • 148
  • 136
0

I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:

/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/

With the .S option to make . match newlines...

glenn mcdonald
  • 15,290
  • 3
  • 35
  • 40
  • No, but they're an expedient way when the needs are simple. But if you want to provide a specific HTML-library answer that's more helpful than my regex one, go right ahead. – glenn mcdonald Nov 28 '10 at 18:01
  • Well, what's wrong with Beautiful Soup? That would be more expedient than implementing your own ad-hoc parser that's incomplete and riddled with bugs. – Nathan Davis Nov 29 '10 at 04:33
  • I mean, provide an answer that shows how to use Beautiful Soup to get the intro sections out of Wikipedia pages, like the questioner wants. If you're right that it's more expedient, then that should be simple and your answer should be clearly better than mine. – glenn mcdonald Nov 29 '10 at 06:58