0

I am trying to develop an Infobox parser in Python which supports all the languages of Wikipedia. The parser will get the infobox data and will return the data in a Dictionary.

The keys of the Dictionary will be the property which is described (e.g. Population, City name, etc...).

The problem is that Wikipedia has slightly different page contents for each language. But the most important thing is that the API response structure for each language can also be different.

For example, the API response for 'Paris' in English contains this Infobox:

{{Infobox French commune |name = Paris |commune status = [[Communes of France|Commune]] and [[Departments of France|department]] |image = <imagemap> File:Paris montage.jpg|275px|alt=Paris montage

and in Greek, the corresponding part for 'Παρίσι' is:

[...] {{Πόλη (Γαλλία) | Πόλη = Παρίσι | Έμβλημα =Blason paris 75.svg | Σημαία =Mairie De Paris (SVG).svg | Πλάτος Σημαίας =120px | Εικόνα =Paris - Eiffelturm und Marsfeld2.jpg [...]

In the second example, there isn't any 'Infobox' occurrence after the {{. Also, in the API response the name = Paris is not the exact translation for Πόλη = Παρίσι. (Πόλη means city, not name)

Because of such differences between the responses, my code fails.

Here is the code:

class WikipediaInfobox():
    # Class to get and parse the Wikipedia Infobox Data

    infoboxArrayUnprocessed = []    # Maintains the order which the data is displayed.
    infoboxDictUnprocessed = {}     # Still Contains Brackets and Wikitext coding. Will be processed more later...
    language="en"

    def getInfoboxDict(self, infoboxRaw): # Get the Infobox in Dict and Array form (Unprocessed)
        if infoboxRaw.strip() == "":
            return {}
        boxLines = [line.strip().replace("  "," ") for line in infoboxRaw.splitlines()]
        wikiObjectType = boxLines[0]
        infoboxData = [line[1:] for line in boxLines[1:]]
        toReturn = {"wiki_type":wikiObjectType}
        for i in infoboxData:
            key = i.split("=")[0].strip()
            value = ""
            if i.strip() != key + "=":
                value=i.split("=")[1].strip()
            self.infoboxArrayUnprocessed.append({key:value})
            toReturn[key]=value
        self.infoboxDictUnprocessed = toReturn
        return toReturn

    def getInfoboxRaw(self, pageTitle, followRedirect = False, resetOld=True): # Get Infobox in Raw Text
        if resetOld:
            infoboxDict = {}
            infoboxDictUnprocessed = {}
            infoboxArray = []
            infoboxArrayUnprocessed = []

        params = { "format":"xml", "action":"query", "prop":"revisions", "rvprop":"timestamp|user|comment|content" }
        params["titles"] = "%s" % urllib.quote(pageTitle.encode("utf8"))
        qs = "&".join("%s=%s" % (k, v)  for k, v in params.items())
        url = "http://" + self.language + ".wikipedia.org/w/api.php?%s" % qs
        tree = etree.parse(urllib.urlopen(url))
        revs = tree.xpath('//rev')
        if len(revs) == 0:
            return ""

        if "#REDIRECT" in revs[-1].text and followRedirect == True:
            redirectPage = revs[-1].text[revs[-1].text.find("[[")+2:revs[-1].text.find("]]")]
            return self.getInfoboxRaw(redirectPage,followRedirect,resetOld)
        elif "#REDIRECT" in revs[-1].text and followRedirect == False:
            return ""

        infoboxRaw = ""
        if "{{Infobox" in revs[-1].text:    # -> No Multi-language support:
            infoboxRaw = revs[-1].text.split("{{Infobox")[1].split("}}")[0]
        return infoboxRaw

    def __init__(self, pageTitle = "", followRedirect = False): # Constructor
        if pageTitle != "":
            self.language = guess_language.guessLanguage(pageTitle)
            if self.language == "UNKNOWN":
                self.language = "en"
            infoboxRaw = self.getInfoboxRaw(pageTitle, followRedirect)
            self.getInfoboxDict(infoboxRaw)  # Now the parsed data is in self.infoboxDictUnprocessed

Some parts of this code was found on this blog...

I don't want to reinvent the wheel, so maybe someone has a nice solution for multi-language support and neat parsing of the Infobox section of Wikipedia.

I have seen many alternatives, like DBPedia or some other parsers that MediaWiki recommends, but I haven't found anything that suits my needs, yet. I also want to avoid scraping the page with BeautifulSoup, because it can fail on some cases, but if it is necessary it will do.

If something isn't clear enough, please ask. I want to help as much as I can.

ant0nisk
  • 581
  • 1
  • 4
  • 17
  • 1
    Have you looked at Wikimedias project [Wikidata](https://wikidata.org)? It can be [queried with SPARQL](https://query.wikidata.org/) so that you do not need to scrape. If you really want to scrape, I suggest you start looking at the code used in [harvest template.py](https://www.mediawiki.org/wiki/Manual:Pywikibot/harvest_template.py) which has code for getting parameters in templates on Wikipedia. – Ainali Dec 05 '15 at 06:16
  • I didn't know about the Wikidata. I believe it is amazing and it will be perfect for my project! Thanks a lot! – ant0nisk Dec 05 '15 at 11:06
  • 1
    You are reinventing wheels, yes :) See http://stackoverflow.com/q/33862336/323407 – Tgr Dec 06 '15 at 00:53

1 Answers1

0

Wikidata is definitely the first choice these days if you want to get structured data, anyway if in the future you need to parse data from wikipedia articles, especially as you are using Python, I can recommand mwparserfromhell which is a python library aimed at parsing wikitext and that has an option to extract templates and their attributes. That won't directly fix your issue as the multiple templates in multiple languages will definitely be different but that might be useful if you continue trying to parse wikitext.

Sylvain
  • 26
  • 4