4

I want to extract Cardinal(CD) values associated with Units of Measurement and store it in a dictionary. For example if the text contains tokens like "20 kgs", it should extract it and keep it in a dictionary.

Example:

  1. for input text, “10-inch fry pan offers superb heat conductivity and distribution”, the output dictionary should look like, {"dimension":"10-inch"}

  2. for input text, "This bucket holds 5 litres of water.", the output should look like, {"volume": "5 litres"}

    line = 'This bucket holds 5 litres of water.'
    tokenized = nltk.word_tokenize(line)
    tagged = nltk.pos_tag(tokenized)
    

The above line would give the output:

[('This', 'DT'), ('bucket', 'NN'), ('holds', 'VBZ'), ('5', 'CD'), ('litres', 'NNS'), ('of', 'IN'), ('water', 'NN'), ('.', '.')]

Is there a way to extract the CD and UOM values from the text?

Alex K.
  • 171,639
  • 30
  • 264
  • 288
Vaulstein
  • 20,055
  • 8
  • 52
  • 73
  • 1
    did you try to use `stanford nlp` ? – Mazdak Dec 15 '14 at 16:25
  • 1
    Any POS tagger should label the CDs pretty accurately, and a fixed mapping from units to your label set would probably capture most instances. I tend to shy away from hand-building fixed lexicons, but this seems like an application where you can get pretty good coverage with a simple list. E.g., given your examples 'lit(er|re)(s?)' -> 'volume', 'inch(es)?' -> 'dimension' (or, 'length' perhaps). You'll have to handle some more complex cases, like 'square meters' or 'in^3', and there will be a few ambiguous references (e.g. 'knots' is both a length and a speed). But those should be rare. – AaronD Dec 15 '14 at 18:12
  • @Kasra:Anything specific that I should refer in **stanford nlp** for the solution? – Vaulstein Dec 16 '14 at 08:20
  • @Dork as far as i know `stanford nlp` is used for special tagging such as extract special names or ... , i'm not sure it could help you but i suggest to have search related to your problem! – Mazdak Dec 16 '14 at 13:20

2 Answers2

2

Not sure how flexible you need the process to be. You can play around with nltk.RegexParser and come up with some good patters:

import nltk

sentence = 'This bucket holds 5 litres of water.'

parser = nltk.RegexpParser(
    """
    INDICATOR: {<CD><NNS>}
    """)

print parser.parse(nltk.pos_tag(nltk.word_tokenize(sentence)))

Output:

(S
  This/DT
  bucket/NN
  holds/VBZ
  (INDICATOR 5/CD litres/NNS)
  of/IN
  water/NN
  ./.)

You can also create a corpus and train a chunker.

bogs
  • 2,286
  • 18
  • 22
1

Hm, not sure if it helps - but I wrote it in Javascript. Here: http://github.com/redaktor/nlp_compromise

It might be a bit undocumented yet but the guys are porting it to a 2.0 branch now.

It should be easy to port to python considering What's different between Python and Javascript regular expressions?

And : Did you check pythons NLTK? : http://www.nltk.org

Community
  • 1
  • 1
sebilasse
  • 4,278
  • 2
  • 37
  • 36
  • The Javascript library is good, I will try experimenting more of it. About the nltk book, the sample code I tried was part of the book. – Vaulstein Sep 03 '15 at 06:58