Extracting the introduction part of a Wikipedia article, by python

Question

I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.

Can anyone give me a quick solution to this? I'm writing python scripts.

thanks

You probably want to be parsing the wiki markup, not the HTML, for this particular operation. — Nick Bastin, Nov 28 '10 at 02:40
can you give more details? I'm not familiar with accessing wiki markup? How should I get it? thanks — green-i, Nov 28 '10 at 02:44

score 3 · Answer 1 · edited May 23 '17 at 12:22

3

You may want to check mwlib to parse the wikipedia source
Alternatively, use the wikidump lib
HTML screen scraping through BeautifulSoup

Ah, there is a question already on SO on this topic:

edited May 23 '17 at 12:22

Community

1
1

answered Nov 28 '10 at 02:48

pyfunc

65,343
15
148
136

score 0 · Accepted Answer · answered Nov 28 '10 at 03:04

0

I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:

/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/

With the .S option to make . match newlines...

answered Nov 28 '10 at 03:04

glenn mcdonald

15,290
3
35
40

No, but they're an expedient way when the needs are simple. But if you want to provide a specific HTML-library answer that's more helpful than my regex one, go right ahead. – glenn mcdonald Nov 28 '10 at 18:01
Well, what's wrong with Beautiful Soup? That would be more expedient than implementing your own ad-hoc parser that's incomplete and riddled with bugs. – Nathan Davis Nov 29 '10 at 04:33
I mean, provide an answer that shows how to use Beautiful Soup to get the intro sections out of Wikipedia pages, like the questioner wants. If you're right that it's more expedient, then that should be simple and your answer should be clearly better than mine. – glenn mcdonald Nov 29 '10 at 06:58

Extracting the introduction part of a Wikipedia article, by python

2 Answers2