2

OS: Windows 7. Jython 2.7.0 "final release".

for token in sorted_cased.keys():
    freq = sorted_cased[ token ]
    if freq > 1:
        print( 'token |%s| unicode? %s' % ( token, isinstance( token, unicode ), ) )
        if re.search( ur'\p{L}+', token ):
            print( '  # cased token |%s| freq %d' % ( token, freq, ))

sorted_cased is a dict showing the frequency of occurrence of tokens. Here I'm trying to weed out the words (unicode characters only) which occur with frequency > 1. (NB I was using re.match rather than search but search should detect event 1 such \p{L} in token)

sample output:

token |Management| unicode? True
token |n| unicode? True
token |identifiés| unicode? True
token |décrites| unicode? True
token |agissant| unicode? True
token |tout| unicode? True
token |sociétés| unicode? True

None is recognising that it has a single [p{L}] in it. I've tried all sorts of permutations: double-quotes, adding flags=re.UNICODE, etc.

later I have been asked to explain why this cannot be classed as a duplicate of How to implement \p{L} in python regex. It CAN, but... the answers in that other question do not draw attention to the need to use the REGEX MODULE (old version? very new version? NB they are different) as opposed to the RE MODULE. For the sake of saving the hair follicles and sanity of future people who come up against this one, I request that the present paragraph be allowed to remain, albeit the question be "duped".

Also my attempt to install Pypi regex module FAILED UNDER JYTHON (using pip). Probably better to use java.util.regex.

Community
  • 1
  • 1
mike rodent
  • 14,126
  • 11
  • 103
  • 157
  • 2
    Python re module does not support `\p{L}` shorthand Unicode category class. – Wiktor Stribiżew Dec 06 '15 at 12:06
  • 2
    use `regex` module.. – Avinash Raj Dec 06 '15 at 12:07
  • 2
    Thanks to you both! I was flummoxed because there is indeed a python question here http://stackoverflow.com/questions/17595979/how-to-implement-pl-in-python-regex using \p{L} ... and, yes, *regex* (a module I had never heard of!) – mike rodent Dec 06 '15 at 12:10
  • 1
    Another alternative is to restrict `\w` class like `(?![\d_])\w` and use a re.UNICODE flag. [*If UNICODE is set, this `\w` will match the characters \[0-9_\] plus whatever is classified as alphanumeric in the Unicode character properties database.*](http://www.jython.org/docs/library/re.html#regular-expression-syntax) – Wiktor Stribiżew Dec 06 '15 at 12:11
  • @stribizhev thanks for that. I have now installed the regex module. But I'm not sure whether it's the old or the new ... the new appears to be very very new: https://pypi.python.org/pypi/regex. That other question I referenced is presumably using the old one... – mike rodent Dec 06 '15 at 12:20
  • When using regex module, to make `\p{L}` match all Unicode letters, you need to make sure you pass a Unicode pattern string (by default, if no `L` or `U` flags are passed, the pattern encoding is used to detect the mode). *If neither the ASCII, LOCALE nor UNICODE flag is specified, it will default to UNICODE if the regex pattern is a Unicode string and ASCII if it’s a bytestring.* – Wiktor Stribiżew Dec 06 '15 at 12:26
  • Reopened since the suggested PyPi `regex` module in the linked dupe does not install in Jyphon and there are other Jyphon-specific solutions to the current problem. – Wiktor Stribiżew Jul 12 '17 at 10:56

1 Answers1

5

If you have access to Java java.util.regex, the best option is to use the in-built \p{L} class.

Python (including the Jython dialect) does not support \p{L} and other Unicode category classes. Nor the POSIX character classes.

Another alternative is to restrict \w class like (?![\d_])\w and use a UNICODE flag. If UNICODE is set, this \w will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.. This alternative has one flaw: it cannot be used inside a character class.

Another idea is to use [^\W\d_] (with re.U flag) that will match any char that is not a non-word (\W), digit (\d) and _ char. It will effectively match any Unicode letter.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Thanks... but... ! I have just realised that my attempt to install the Pypi regex module failed. This may be because I am using Jython rather than CPython. Jythonistas probably need to admit "pythonic defeat" at this point and use Java classes instead: java.util.regex – mike rodent Dec 06 '15 at 12:40
  • 1
    Java's regex module has support for all POSIX classes via `\p` syntax, and you can use `\pP` there instead of `\p{P}`. – Wiktor Stribiżew Dec 06 '15 at 12:57
  • Actually, there is no flaw in `(?![\d_])\w` construction, since you can actually include more in the `\w` part and exclude more in the look-ahead part. Nothing prevents you from doing so. – nhahtdh Dec 07 '15 at 03:31