5

I was trying to match all the string that contain one word in any language. My search led me to \p{...} which was absent in python's Re module. But I found https://pypi.python.org/pypi/regex. It should work with \p{...} commands. Although it doesn't.

I tried parsing those lines:

7652167371  apéritif
78687   attaché
78687   époque
78678   kunngjøre
78678   ærbødig
7687    vår
12312   dfsdf
23123   322432
1321    23123
2312    привер
32211   оипвыола

With:

def Pattern_compile(pattern_array):
    regexes = [regex.compile(p) for p in pattern_array]
    return regexes

def main():
    for line in sys.stdin:
        for regexp in Pattern_compile(p_a):
            if regexp.search (line):
                print line.strip('\n')

if __name__ == '__main__':
    p_a = ['^\d+\t(\p{L}|\p{M})+$', ]
    main()

The result is only latin-character word:

12312   dfsdf
antonavy
  • 480
  • 6
  • 11
  • on a quick glance, you don't pass your regex parameter to the main function. Try def main(p_a): and in the last line main(p_a) – lucasg Jul 11 '13 at 14:27
  • But if p_a is empty, everything should match - as everything matches the empty regex. – FrankieTheKneeMan Jul 11 '13 at 14:34
  • I usually use re2 from Google, it's more powerful although I don't know if it cover what you need. [Re2](https://pypi.python.org/pypi/re2/) – PepperoniPizza Jul 11 '13 at 14:35
  • Thanks Pizza, maybe i'll give it a try. And about main(p_a) - i'm new to python but isn't a global variable that should work in both cases? – antonavy Jul 12 '13 at 16:25

1 Answers1

2

You should pass unicode. (Both regular expression and the string)

import sys

import regex


def main(patterns):
    patterns = [regex.compile(p) for p in patterns]
    for line in sys.stdin:
        line = line.decode('utf8')
        for regexp in patterns:
            if regexp.search (line):
                print line.strip('\n')

if __name__ == '__main__':
    main([ur'^\d+\t(\p{L}|\p{M})+$', ])
falsetru
  • 357,413
  • 63
  • 732
  • 636
  • Thanks. That really helped. I always wrote just r"" in regex before. And i guess i should always check with decode('utf-8') cause it's often a problem. Although i had to use line1 = line.decode('utf8'), otherwise the UnicodeEncodeError has been thrown. – antonavy Jul 12 '13 at 16:20
  • @Derlin, Question is tagged `python-2.7`. BTW, you'd better to comment OP's question so that OP can notice. – falsetru Apr 05 '18 at 04:14
  • How should you make this work on Python 3? I am getting bad scape \p if I do with just the `r` and SyntaxError if I do it with `ur` – srcolinas Feb 12 '21 at 22:02
  • @srcolinas, `r'...'`, `ur''...` ([Python 3.3+ support u'unicode' syntax](https://docs.python.org/3/whatsnew/3.3.html)) both should work. BTW, make sure you're using `regex` (not `re`). – falsetru Feb 13 '21 at 00:18
  • 1
    Thanks! I was using `re` . In any case `ur' '` is not valid syntax in Python 3.7 – srcolinas Feb 13 '21 at 18:35
  • Oh, you're right. https://www.python.org/dev/peps/pep-0414/ – falsetru Feb 13 '21 at 22:11