Python Regex for greek words

Question

I want to make a python script that uses a regular expression to filter the lines that have certain greek words out of a source text which I provided and then write those lines to 3 different files depending on the words encountered.

Here is my code so far:

import regex

source=open('source.txt', 'r')
oti=open('results_oti.txt', 'w')
tis=open('results_tis.txt', 'w')
ton=open('results_ton.txt', 'w')

regex_oti='^.*\b(ότι|ό,τι)\b.*$'
regex_tis='^.*\b(της|τις)\b.*$'
regex_ton='^.*\b(τον|των)\b.*$'

for line in source.readlines():
    if regex.match(regex_oti, line):
        oti.write(line)
    if regex.match(regex_tis, line):
        tis.write(line)
    if regex.match(regex_ton, line):
        ton.write(line)
source.close()
oti.close()
tis.close()
ton.close()
quit()

The words that I check for are ότι | ό,τι | της | τις | τον | των.

The problem is that those 3 regular expressions (regex_oti, regex_tis, regex_ton) do not match anything so the 3 text files I created do not contain anything.

Maybe its an encoding problem (Unicode)?

You have to use raw unicode strings in regexp along with re.U flag in order to make it work. — alko, Nov 13 '13 at 21:27
@alko: the `re.U` flag is implied when using a `unicode` string value for the regular expression. — Martijn Pieters, Nov 13 '13 at 21:38
Tongue-in-cheek answer - use Python 3.x to avoid problems with unicode problems. — rlms, Nov 14 '13 at 19:00

score 1 · Accepted Answer · answered Nov 13 '13 at 21:38

You are trying to match encoded values, as bytes, with a regular expression that most likely won't match unless your Python source encoding exactly matches that of the input files, and then only if you are not using a multi-byte encoding such as UTF-8.

You need to decode the input files to Unicode values, and use a Unicode regular expression. This means you need to know the codecs used for the input files. It's easiest to use io.open() to handle decoding and encoding:

import io
import re

regex_oti = re.compile(ur'^.*\b(ότι|ό,τι)\b.*$')
regex_tis = re.compile(ur'^.*\b(της|τις)\b.*$')
regex_ton = re.compile(ur'^.*\b(τον|των)\b.*$')

with io.open('source.txt', 'r', encoding='utf8') as source, \
     io.open('results_oti.txt', 'w', encoding='utf8') as oti, \
     io.open('results_tis.txt', 'w', encoding='utf8') as tis, \
     io.open('results_ton.txt', 'w', encoding='utf8') as ton:

    for line in source:
        if regex_oti.match(line):
            oti.write(line)
        if regex_tis.match(line):
            tis.write(line)
        if regex_ton.match(line):
            ton.write(line)

Note the ur'...' raw unicode strings to define the regular expression patterns; now these are Unicode patterns and match codepoints, not bytes.

The io.open() call makes sure you read unicode, and when you write unicode values to the the output files the data is automatically encoded to UTF-8. I picked UTF-8 for the input file as well, but you need to check what the correct codec is for that file and stick to that.

I've used a with statement here to have the files close automatically, used source as an iterable (no need to read all lines into memory in one go), and pre-compiled the regular expressions.

i dont think this will work... it throws me an error that says: UnicodeDecodeError:'utf-8' codec can't decode byte 0xca in position 4: invalid continuation byte my source file has generally many symbols and stuff... here is the first line of it : CON[Και] VP[έχουν περάσει] NP[*πολλοί] CON[και] NP[*« χαρισματικοί *» *.] # dont bother checking it out it is meant for specific use of machine learning... — Ioannis Petridis, Nov 14 '13 at 00:30
Then your input is not UTF8. Pick the right codec for the file. — Martijn Pieters, Nov 14 '13 at 00:32
How do i know which codec is right? Sry for the trouble ... i am new to all this :) And something else, does the encoding of the source file and the output files (the 3 of em) has to be the same ? — Ioannis Petridis, Nov 14 '13 at 00:43
No, the output encoding can be different. Sometimes, working out the *input* codec requires trial and error. What codec are you using when viewing the file? — Martijn Pieters, Nov 14 '13 at 00:58
"ANSI" is [Windows codepage 1252](http://stackoverflow.com/questions/701882/what-is-ansi-format); use `'cp1252'` in Python. Or, at least, that is what it *usually* means. In the console, run `chcp`, it'll print the current codepage. If it says `1253` (Greek codepage) then use `cp1253` obviously. — Martijn Pieters, Nov 14 '13 at 08:18
HA ! Finally figured it out... Turns out i had to use encoding "Windows-1253" .... u never know where u get stuck with some things :D :D kudos to you mate @Martijn Pieters — Ioannis Petridis, Nov 14 '13 at 14:14

Python Regex for greek words

1 Answers1