0

I have a wikipedia xml dump file that has been stripped of all tags and content that isn't the actual text. I am trying to automate parsing through the entire dump to extract well formatted sentences in python. A sample from the text is:

{{Nihongo|'''''Barefoot Gen'''''|はだしのゲン|Hadashi no Gen}} is a [[Japan]]ese [[manga]] series by [[Keiji Nakazawa]]. Loosely based on Nakazawa's own experiences as a Hiroshima survivor, the series begins in 1945 in and around [[Hiroshima]], [[Japan]], where the six-year-old boy [[Gen Nakaoka]] lives with his family.

This is what I have now:

nonalphanum = "~`!@#$%^&*()_+=-\][|}{;:\"/.,?><"

class sentence:

#Instantiation function
def __init__( self, wiki_str ):
    self.words = wiki_str.translate( None, nonalphanum ).split()
    self.size = len( self.words )
    print( self.words, self.size )

And my output is:

(["Nihongo'''''Barefoot", "Gen'''''\xe3\x81\xaf\xe3\x81\xa0\xe3\x81\x97\xe3\x81\xae\xe3\x82\xb2\xe3\x83\xb3Hadashi", 'no', 'Gen', 'is', 'a', 'Japanese', 'manga', 'series', 'by', 'Keiji', 'Nakazawa', 'Loosely', 'based', 'on', "Nakazawa's", 'own', 'experiences', 'as', 'a', 'Hiroshima', 'survivor', 'the', 'series', 'begins', 'in', '1945', 'in', 'and', 'around', 'Hiroshima', 'Japan', 'where', 'the', 'sixyearold', 'boy', 'Gen', 'Nakaoka', 'lives', 'with', 'his', 'family'], 42)

What I want is

Nihongo Barefoot Gen Hadashi no Gen is a Japanese manga series by Keiji Nakazawa. Loosely based on Nakazawa's own experiences as a Hiroshima survivor, the series begins in 1945 in and around Hiroshima, Japan, where the six-year-old boy Gen Nakaoka lives with his family.

Thank you for any help!

MrWolvwxyz
  • 93
  • 2
  • 10
  • @Blender , I believe the article you pointed towards is for media wiki dumps which are metadata where this is article xml. – MrWolvwxyz Nov 06 '13 at 05:11
  • The MediaWiki format is still the same for both, so you can parse it with the same parser. – Blender Nov 06 '13 at 05:14

0 Answers0