OS: Windows 7. Jython 2.7.0 "final release".
for token in sorted_cased.keys():
freq = sorted_cased[ token ]
if freq > 1:
print( 'token |%s| unicode? %s' % ( token, isinstance( token, unicode ), ) )
if re.search( ur'\p{L}+', token ):
print( ' # cased token |%s| freq %d' % ( token, freq, ))
sorted_cased
is a dict showing the frequency of occurrence of tokens. Here I'm trying to weed out the words (unicode characters only) which occur with frequency > 1. (NB I was using re.match
rather than search
but search
should detect event 1 such \p{L} in token
)
sample output:
token |Management| unicode? True
token |n| unicode? True
token |identifiés| unicode? True
token |décrites| unicode? True
token |agissant| unicode? True
token |tout| unicode? True
token |sociétés| unicode? True
None is recognising that it has a single [p{L}] in it. I've tried all sorts of permutations: double-quotes, adding flags=re.UNICODE
, etc.
later I have been asked to explain why this cannot be classed as a duplicate of How to implement \p{L} in python regex. It CAN, but... the answers in that other question do not draw attention to the need to use the REGEX MODULE (old version? very new version? NB they are different) as opposed to the RE MODULE. For the sake of saving the hair follicles and sanity of future people who come up against this one, I request that the present paragraph be allowed to remain, albeit the question be "duped".
Also my attempt to install Pypi regex module FAILED UNDER JYTHON (using pip). Probably better to use java.util.regex.