1

With the help of joksnet's programs here I've managed to get plaintext Wikipedia articles that I'm looking for.

The text returned includes Wiki markup for the headings, so for example, the sections of the Albert Einstein article are returned like this:

==Biography==

===Early life and education===
blah blah blah

What I'd really like to do is feed the retrieved text to a function and wrap all the top level sections in bold html tags and the second level sections in italics, like this:

<b>Biography</b>

<i>Early life and education</i>
blah blah blah

But I'm afraid I don't know how to even start, at least not without making the function dangerously naive. Do I need to use regular expressions? Any suggestions greatly appreciated.

PS Sorry if "parsing" is too strong a word for what I'm trying to do here.

Community
  • 1
  • 1
Alex S
  • 4,726
  • 7
  • 39
  • 67
  • Why make yet another parser? Can't you just [get the HTML from the API](https://www.mediawiki.org/wiki/API:Parsing_wikitext) and alter/style the h2 and h3 tags with JavaScript/CSS? – Nemo Nov 02 '15 at 11:04

3 Answers3

2

I think the best way here would be to let MediaWiki take care of the parsing. I don't know the library you're using, but basically this is the difference between

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Albert%20Einstein&rvprop=content

which returns the raw wikitext and

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Albert%20Einstein&rvprop=content&rvparse

which returns the parsed HTML.

svick
  • 236,525
  • 50
  • 385
  • 514
  • Thanks, I may end up doing that but I've already tried recovering HTML and it gave me unicode encoding errors I didn't know how to fix. Plus converting the HTML to plaintext wasn't that straightforward either. Maybe I'll try making a parser that just goes through the text and replaces every first === with and every second with , then goes through it again and replaces every == with and every second with . Problem is if the count ever gets off it'll break, but I guess it should work in most situations... – Alex S May 28 '13 at 14:45
1

You can use regex and scraping modules like Scrapy and Beautifulsoup to parse and scrape wiki pages. Now that you clarified your question I suggest you use the py-wikimarkup module that is hosted on github. The link is https://github.com/dcramer/py-wikimarkup/ . I hope that helps.

Rapture
  • 113
  • 8
  • Thanks, but I don't think that's quite what I'm looking for. I have already gotten the pages in almost exactly the format I want them. I just want to replace the `==Heading 1==` and `===Heading 2===` with `Heading 1` and `Heading 2`. I don't think Beautifulsoup or Scrapy can help me with that. – Alex S May 28 '13 at 05:49
1

I ended up doing this:

def parseWikiTitles(x):
    counter = 1

    while '===' in x:
        if counter == 1:
            x = x.replace('===','<i>',1)
            counter = 2

        else:
            x = x.replace('===',r'</i>',1)
            counter = 1

    counter = 1

    while '==' in x:
        if counter == 1:
            x = x.replace('==','<b>',1)
            counter = 2

        else:
            x = x.replace('==',r'</b>',1)
            counter = 1


    x = x.replace('<b> ', '<b>', 50)
    x = x.replace(r' </b>', r'</b>', 50)
    x = x.replace('<i> ', '<i>', 50)
    x = x.replace(r' </i>', r'<i>', 50)

    return x

I pass the string of text with wiki titles to that function and it returns the same text with the == and === replaced with bold and italics HTML tags. The last thing removes spaces before and after titles, for example == title == gets converted to <b>title</b> instead of <b> title </b>

Has worked without problem so far.

Thanks for the help guys, Alex

Alex S
  • 4,726
  • 7
  • 39
  • 67