1

I am looking for an algorithm or method that would help identify general phrases from a corpus of text that has a particular dialect (it is from a specific domain but for my case is a dialect of the English language) -- for example the following fragment could be from a larger corpus related to the World or Warcraft or perhaps MMORPHs.

players control a character avatar within a game world in third person or first person view, exploring the landscape, fighting various monsters, completing quests, and interacting with non-player characters (NPCs) or other players. Also similar to other MMORPGs, World of Warcraft requires the player to pay for a subscription, either by buying prepaid game cards for a selected amount of playing time, or by using a credit or debit card to pay on a regular basis

As output from the above I would like to identify the following general phrases:

  1. first person
  2. World of Warcraft
  3. prepaid game cards
  4. debit card

Notes:

  1. There is a previous questions similar to mine here and here but for clarification mine has the following differences:

    a. I am trying to use an existing toolkit such as NLTK, OpenNLP, etc.

    b. I am not interested in identifying other Parts of Speech in the sentence

    c. I can use human intervention where the algorithm presents the identified noun phrases to a human expert and the human expert can then confirm or reject the findings however we do not have resources for training a model of language on hand-annotated data

Community
  • 1
  • 1
user1172468
  • 5,306
  • 6
  • 35
  • 62
  • Out of curiousity, how did you compile the WoW chat corpus? – alvas Sep 09 '13 at 05:47
  • Oh that was just an example - the real target application is for a vertical domain that would be a poor example. – user1172468 Sep 09 '13 at 06:18
  • 1
    you might be interested in this thread: http://listserv.linguistlist.org/cgi-bin/wa?A2=ind1309&L=CORPORA&F=&S=&P=23253 – alvas Sep 13 '13 at 09:14

2 Answers2

1

Nltk has built in part of speech tagging that has proven pretty good at identifying unknown words. That said, you seem to misunderstand what a noun is and you should probably solidify your understanding of both parts of speech, and your question.

For instance, in first person first is an adjective. You could automatically assume that associated adjectives are a part of that phrase.

Alternately, if you're looking to identify general phrases my suggestion would be to implement a simple Markov Chain model and then look for especially high transition probabilities.

If you're looking for a Markov Chain implementation in Python I would point you towards this gist that I wrote up back in the day: https://gist.github.com/Slater-Victoroff/6227656

If you want to get much more advanced than that, you're going to quickly descend into dissertation territory. I hope that helps.

P.S. Nltk includes a huge number of pre-annotated corpuses that might work for your purposes.

Slater Victoroff
  • 21,376
  • 21
  • 85
  • 144
1

It appears you are trying to do noun phrase extraction. The TextBlob Python library includes two noun phrase extraction implementations out of the box.

The simplest way to get started is to use the default FastNPExtractor which is based of Shlomi Babluki's algorithm described here.

from text.blob import TextBlob

text = '''
players control a character avatar within a game world in third person or first
person view, exploring the landscape, fighting various monsters, completing quests,
and interacting with non-player characters (NPCs) or other players. Also similar
to other MMORPGs, World of Warcraft requires the player to pay for a
subscription, either by buying prepaid game cards for a selected amount of
playing time, or by using a credit or debit card to pay on a regular basis
'''

blob = TextBlob(text)
print(blob.noun_phrases)  # ['players control', 'character avatar' ...]

Swapping out for the other implementation (an NLTK-based chunker) is quite easy.

from text.np_extractors import ConllExtractor

blob = TextBlob(text, np_extractor=ConllExtractor())

print(blob.noun_phrases)  # ['character avatar', 'game world' ...]

If neither of these suffice, you can create your own noun phrase extractor class. I recommend looking at the TextBlob np_extractor module source for examples. To gain a better understanding of noun phrase chunking, check out the NLTK book, Chapter 7.

Steve L
  • 1,704
  • 1
  • 20
  • 29