0

Mr. Wiktor, this question is by no means a duplicate of the question you juxtaposed in your justification of marking this a duplicate. To wit, the question you pointed to asks what's the countpart of preg_match in python. I, even in the TITLE ITSELF mention the "re.search" which was the answer to the thread you mentioned. I'm aware of re.search My question is SPECIFICALLY to how I can use the 3rd argument in re.search the way that its counterpart in php is used in the example I provided. Mr. Wiktor, I respectfully request that you unmark my thread as duplicate Thank you in advance sir.

What I'm trying to do is Stemming (NLP) for the Greek language in python. The php code is this:

protected static $step1list = array(
    "φαγια"=>"φα",
    "φαγιου"=>"φα",
    "φαγιων"=>"φα",
    "σκαγια"=>"σκα",
    "σκαγιου"=>"σκα",
    "σκαγιων"=>"σκα",
    "ολογιου"=>"ολο",
    "ολογια"=>"ολο",
    "ολογιων"=>"ολο",
    "σογιου"=>"σο",
    "σογια"=>"σο",
    "σογιων"=>"σο",
    "τατογια"=>"τατο",
    "τατογιου"=>"τατο",
    "τατογιων"=>"τατο",
    "κρεασ"=>"κρε",
    "κρεατοσ"=>"κρε",
    "κρεατα"=>"κρε",
    "κρεατων"=>"κρε",
    "περασ"=>"περ",
    "περατοσ"=>"περ",
    "περατα"=>"περ",
    "περατων"=>"περ",
    "τερασ"=>"τερ",
    "τερατοσ"=>"τερ",
    "τερατα"=>"τερ",
    "τερατων"=>"τερ",
    "φωσ"=>"φω",
    "φωτοσ"=>"φω",
    "φωτα"=>"φω",
    "φωτων"=>"φω",
    "καθεστωσ"=>"καθεστ",
    "καθεστωτοσ"=>"καθεστ",
    "καθεστωτα"=>"καθεστ",
    "καθεστωτων"=>"καθεστ",
    "γεγονοσ"=>"γεγον",
    "γεγονοτοσ"=>"γεγον",
    "γεγονοτα"=>"γεγον",
    "γεγονοτων"=>"γεγον"
);
protected static $step1regexp="/(.*)(φαγια|φαγιου|φαγιων|σκαγια|σκαγιου|σκαγιων|ολογιου|ολογια|ολογιων|σογιου|σογια|σογιων|τατογια|τατογιου|τατογιων|κρεασ|κρεατοσ|κρεατα|κρεατων|περασ|περατοσ|περατα|περατων|τερασ|τερατοσ|τερατα|τερατων|φωσ|φωτοσ|φωτα|φωτων|καθεστωσ|καθεστωτοσ|καθεστωτα|καθεστωτων|γεγονοσ|γεγονοτοσ|γεγονοτα|γεγονοτων)$/u";

$w;
$stem="";
$suffix="";
$firstch="";

if (preg_match($step1regexp, $w, $fp)) {
    $stem = $fp[1];
    $suffix = $fp[2];
    $w = $stem.$step1list[$suffix];
}

The latest thing i've tried is this (i don't rly have blah on the lists, they're the same as the php one):

import re

step1list = {
    u"φαγια": u"φα",
    blah blah blah blah
    }

stem = ""
suffix=""
firstch=""

s = u"σογια"
reg = re.compile(r'/(.*)(φαγια|φαγιου|φαγιων|σκαγια|σκαγιου|σκαγιων|ολογιου|ολογια|ολογιων|σογιου|σογια|σογιων|τατογια|τατογιου|τατογιων|κρεασ|κρεατοσ|κρεατα|κρεατων|περασ|περατοσ|περατα|περατων|τερασ|τερατοσ|τερατα|τερατων|φωσ|φωτοσ|φωτα|φωτων|καθεστωσ|καθεστωτοσ|καθεστωτα|καθεστωτων|γεγονοσ|γεγονοτοσ|γεγονοτα|γεγονοτων)$');
m = reg.search(s)
if m:
    stem = m.group(1);
    suffix = m.group(2);
    s = "{0}{1}".format(stem, step1list[suffix])
print(s)
print(stem)
print(suffix)

what I get as a result is:

σογια

(with 2 blank lines after it) which means that the 2 groups are not successfully identified :(

How do I mend this?

N1h1l1sT
  • 105
  • 1
  • 11
  • Just try to get a match with `re.search`, then check if the match object is not None, then access the values in the match object. – Wiktor Stribiżew May 15 '16 at 17:59
  • 1
    How do you mean that exactly? The "re.search(re, w, fp)" returns a Boolean as far as I know, so I can't do "Object = re.search(re, w, fp)" because all I'll get is a true or false. Could you please explain further what you mean? – N1h1l1sT May 15 '16 at 18:04
  • Python re.search does not have an overload that would accept a list argument passed by reference. [`re.search`](https://docs.python.org/2/library/re.html#re.search) returns the match data object or None. – Wiktor Stribiżew May 15 '16 at 18:08
  • 1
    See [this Python demo](http://ideone.com/A6rfup). I hope that helps you understand how this is done in Python. You use `fp` as some variable that you think will be a list - no, it is a *flag*, like `/s` (dotall), or `/i` (case insensitive) modifiers in PHP. – Wiktor Stribiżew May 15 '16 at 18:12
  • Prefix your regular expression string with `u` similar to `step1list` (i.e., `re.compile(u'...')`. While your search string is unicode, your regex is binary (the greek characters are likely UTF-8 encoded). – Uyghur Lives Matter May 16 '16 at 00:09

1 Answers1

1

from the docs: (also see match vs search)

import re
p = re.compile( regex )
m = p.search( 'string goes here' ) #p.match() to find from start of string only
if m:
    print 'Match found: ', m.group() # group(1...n) for capture groups
else:
    print 'No match'
Community
  • 1
  • 1
Scott Weaver
  • 7,192
  • 2
  • 31
  • 43
  • I've been trying to make it work with this for the past hour but I can't. I understand what it does, but maybe the regex I use itself has a problem? let me give you more info and maybe you could help me – N1h1l1sT May 15 '16 at 19:27
  • I've updated the original post to reflect the changes, if you and Mr. @Wiktor-Stribiżew could take another look i'd appreciate it a lot. – N1h1l1sT May 15 '16 at 19:58
  • can you provide a regex101.com link that illustrates the problem? – Scott Weaver May 15 '16 at 20:07
  • Unfortunately I don't know much about regular expressions - but the php ones i'm using are certainly correct (for php at least??) because this procedure works on php. I just realised that on python, the "if m" returns false because it DOESN'T go inside the if On the php with the exact same word "σογια" it does, and when it executes, s becomes "σογια", stem becomes "", and suffix becomes "σογια" – N1h1l1sT May 15 '16 at 20:48
  • regex101.com will allow you to pick either pcre(php) or python style regex. – Scott Weaver May 15 '16 at 20:51
  • Okay, here's what works correctly (php) https://regex101.com/r/uW7fB6/1 Here's on python: https://regex101.com/r/fG0pX7/1 Doesn't work :S – N1h1l1sT May 15 '16 at 20:57
  • maybe `p = re.compile(r'regex', re.U)` (since your regex is unicode) ? – Scott Weaver May 15 '16 at 21:02
  • Bloody hell! I figured it it seems - i made it to work Thank you SO MUCH for your help - wouldn't have done it without your suggestion or regex101 What a brilliant site! thanks for letting me know!! Edit: what i did was: reg = re.compile(u"(.*)(φαγια|blahblah)$"); – N1h1l1sT May 15 '16 at 21:05