Get a whole unicode sentence

Question

I'm trying to parse a sentence like Base: Lote Numero 1, Marcelo T de Alvear 500. Demanda: otras palabras. I want to: first, split the text by periods, then, use whatever is before the colon as a label for the sentence after the colon. Right now I have the following definition:

from pyparsing import *

unicode_printables = u''.join(unichr(c) for c in xrange(65536) 
                                    if not unichr(c).isspace())

def parse_test(text):
    label = Word(alphas)+Suppress(':')
    value = OneOrMore(Word(unicode_printables)|Literal(','))
    group = Group(label.setResultsName('label')+value.setResultsName('value'))
    exp = delimitedList(
        group,
        delim='.'
    )

    return exp.parseString(text)

And kind of works but it drops the unicode caracters (and whatever that is not in alphanums) and I'm thinking that I would like to have the value as a whole sentence and not this: 'value': [(([u'Lote', u'Numero', u'1', ',', u'Marcelo', u'T', u'de', u'Alvear', u'500'], {}), 1).

Is a simple way to tackle this?

I think this [answer][1] by Paul McGuire has what you want. [1]: http://stackoverflow.com/questions/2339386/python-pyparsing-unicode-characters/2340659#2340659 — Ehtesh Choudhury, Oct 06 '11 at 03:48
Great, that'll help me deal with unicodes in my definition, but will not give me the sentence as a whole, right? — tutuca, Oct 06 '11 at 04:45
Yup. That's a step in the right direction, right? So instead of `Word(alphanums)`, you'd call `Word(unicodePrintables)` — Ehtesh Choudhury, Oct 06 '11 at 05:08

score 2 · Accepted Answer · answered Oct 06 '11 at 10:34

To directly answer your question, wrap your value definition with originalTextFor, and this will give you back the string slice that the matching tokens came from, as a single string. You could also add a parse action, like:

value.setParseAction(lambda t : ' '.join(t))

But this would explicitly put a single space between each item, when there might have been no spaces (in the case of a ',' after a word), or more than one space. originalTextFor will give you the exact input substring. But even simpler, if you are just reading everything after the ':', would be to use restOfLine. (Of course, the simplest would be just to use split(':'), but I assume you are specifically asking how to do this with pyparsing.)

A couple of other notes:

xxx.setResultsName('yyy') can be shortened to just xxx('yyy'), improving the readability of your parser definition.
Your definition of value as OneOrMore(Word(unicode_printables) | Literal(',')) has a couple of problems. For one thing, ',' will be included in the set of characters in unicode_printables, so ',' will be included in with any parsed words. The best way to solve this is to use the excludeChars parameter to Word, so that your sentence words do not include commas: OneOrMore(Word(unicode_printables, excludeChars=',') | ','). Now you can also exclude other possible punctuation, like ';', '-', etc. just be adding them in the excludeChars string. (I just noticed that you are using '.' as a delimiter for a delimitedList - for this to work, you will have to include '.' as an excluded character too.) Pyparsing is not like a regular expression in this regard - it does not do any lookahead to try to match the next token in the parser if the next character continues to match the current token. That is why you have to do some extra work of your own to avoid reading too much. In general, something as open-ended as OneOrMore(Word(unicode_printables)) is very likely to eat up the entire rest of your input string.

score 1 · Answer 2 · answered Oct 06 '11 at 02:17

1

You should look into PyICU which provides access to the rich Unicode text library provided by ICU, including the BreakIterator class that provides a sentence finder.

answered Oct 06 '11 at 02:17

Mike Sokolov

6,914
2
23
31

Interesting, will try to make some test with PyICU, and see if it fits. – tutuca Oct 06 '11 at 02:22

Get a whole unicode sentence

2 Answers2