0

I'm trying to parse some wikitext. Here's an example of the text I need to parse:

== title ==
=== subtopic ===
*text_1
**text dependent on text_1
**text_2 dependent on text_1
*text_2
**text dependent on text_2
=== other subtopic ===
*text_2
**text dependent on text_2
== other title ==
...

There structure here is not that complicated:
title I believe there's at least a title in the whole document
subtopics are optional
elements There have to be at least one per topic/subtopic
sub-elements are optional and can be repeated

In case sub-elements are repeated I intend to unify them using \ln.

What I want to do is to parse this into dictionaries being the structure the following:

{
"title": "title"
"subtopic": "subtopic"
"main_text": "text_1"
"sub_text": "text dependent on text_1 \ln text_2 dependent on text_1"}

Do you know any pythonic way or ideas to parse this into what I want? I will really appreciate your time.

PS. Here's the complete file I'm trying to parse and extract the quotes from: Woody Allen

gglasses
  • 826
  • 11
  • 30
  • possible duplicate of [Parsing a Wikipedia dump](http://stackoverflow.com/questions/3463447/parsing-a-wikipedia-dump) – Mel Aug 21 '15 at 08:41
  • There does not seem to be a list that matches your format on Woody Allen’s Wikipedia page… – poke Oct 11 '15 at 11:23
  • @poke, because that's the format of a Wikiquote page, see my answer. – Nemo Oct 11 '15 at 11:38

1 Answers1

0

You said "quotes" but you linked Wikipedia. Did you mean Wikiquote?

Anyway, you must not parse wikitext yourself. Your aim is fulfilled by the parse API which you can access with a Python client.

For instance, list of sections (i.e. quoted works) on his Wikiquote article, https://en.wikiquote.org/w/api.php?action=parse&page=Woody_Allen&prop=sections :

{
    "parse": {
        "title": "Woody Allen",
        "pageid": 80,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "Quotes",
                "number": "1",
                "index": "1",
                "fromtitle": "Woody_Allen",
                "byteoffset": 657,
                "anchor": "Quotes"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Getting Even</i> (1971)",
                "number": "1.1",
                "index": "2",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11322,
                "anchor": "Getting_Even_.281971.29"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "<i>My Philosophy</i>",
                "number": "1.1.1",
                "index": "3",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11471,
                "anchor": "My_Philosophy"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Everything You Always Wanted to Know About Sex* (*But Were Afraid to Ask)</i> (1972)",
                "number": "1.2",
                "index": "4",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11814,
                "anchor": "Everything_You_Always_Wanted_to_Know_About_Sex.2A_.28.2ABut_Were_Afraid_to_Ask.29_.281972.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Sleeper</i> (1973)",
                "number": "1.3",
                "index": "5",
                "fromtitle": "Woody_Allen",
                "byteoffset": 12364,
                "anchor": "Sleeper_.281973.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Love and Death</i> (1975)",
                "number": "1.4",
                "index": "6",
                "fromtitle": "Woody_Allen",
                "byteoffset": 12858,
                "anchor": "Love_and_Death_.281975.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Without Feathers</i> (1975)",
                "number": "1.5",
                "index": "7",
                "fromtitle": "Woody_Allen",
                "byteoffset": 14090,
                "anchor": "Without_Feathers_.281975.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Annie Hall</i> (1977)",
                "number": "1.6",
                "index": "8",
                "fromtitle": "Woody_Allen",
                "byteoffset": 16485,
                "anchor": "Annie_Hall_.281977.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Side Effects</i> (1980)",
                "number": "1.7",
                "index": "9",
                "fromtitle": "Woody_Allen",
                "byteoffset": 16899,
                "anchor": "Side_Effects_.281980.29"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "My Apology",
                "number": "1.7.1",
                "index": "10",
                "fromtitle": "Woody_Allen",
                "byteoffset": 17529,
                "anchor": "My_Apology"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Manhattan Murder Mystery</i> (1993)",
                "number": "1.8",
                "index": "11",
                "fromtitle": "Woody_Allen",
                "byteoffset": 18579,
                "anchor": "Manhattan_Murder_Mystery_.281993.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Don't Drink the Water</i> (1994)",
                "number": "1.9",
                "index": "12",
                "fromtitle": "Woody_Allen",
                "byteoffset": 18960,
                "anchor": "Don.27t_Drink_the_Water_.281994.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Deconstructing Harry</i> (1997)",
                "number": "1.10",
                "index": "13",
                "fromtitle": "Woody_Allen",
                "byteoffset": 19228,
                "anchor": "Deconstructing_Harry_.281997.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Standup Comic</i> (1999)",
                "number": "1.11",
                "index": "14",
                "fromtitle": "Woody_Allen",
                "byteoffset": 21289,
                "anchor": "Standup_Comic_.281999.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Mere Anarchy</i> (2007)",
                "number": "1.12",
                "index": "15",
                "fromtitle": "Woody_Allen",
                "byteoffset": 22463,
                "anchor": "Mere_Anarchy_.282007.29"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Attributed",
                "number": "2",
                "index": "16",
                "fromtitle": "Woody_Allen",
                "byteoffset": 24181,
                "anchor": "Attributed"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Others",
                "number": "3",
                "index": "17",
                "fromtitle": "Woody_Allen",
                "byteoffset": 25045,
                "anchor": "Others"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Quotes about Allen",
                "number": "4",
                "index": "18",
                "fromtitle": "Woody_Allen",
                "byteoffset": 27525,
                "anchor": "Quotes_about_Allen"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "External links",
                "number": "5",
                "index": "19",
                "fromtitle": "Woody_Allen",
                "byteoffset": 29106,
                "anchor": "External_links"
            }
        ]
    }
}
Nemo
  • 2,441
  • 2
  • 29
  • 63
  • This will not give you the actual section text, and if you use the parse API for that, you get HTML—which need to parse too. So you’re just moving the “I need to parse this” problem from wikitext to HTML. – poke Oct 11 '15 at 11:31
  • @poke, the OP never said they need plain text. As for the content, I included only section titles for brevity but I linked the docs which explain how to get the text contained in them with `section` and `prop=text` parameters. – Nemo Oct 11 '15 at 11:40