Mr. Wiktor, this question is by no means a duplicate of the question you juxtaposed in your justification of marking this a duplicate. To wit, the question you pointed to asks what's the countpart of preg_match in python. I, even in the TITLE ITSELF mention the "re.search" which was the answer to the thread you mentioned. I'm aware of re.search My question is SPECIFICALLY to how I can use the 3rd argument in re.search the way that its counterpart in php is used in the example I provided. Mr. Wiktor, I respectfully request that you unmark my thread as duplicate Thank you in advance sir.
What I'm trying to do is Stemming (NLP) for the Greek language in python. The php code is this:
protected static $step1list = array(
"φαγια"=>"φα",
"φαγιου"=>"φα",
"φαγιων"=>"φα",
"σκαγια"=>"σκα",
"σκαγιου"=>"σκα",
"σκαγιων"=>"σκα",
"ολογιου"=>"ολο",
"ολογια"=>"ολο",
"ολογιων"=>"ολο",
"σογιου"=>"σο",
"σογια"=>"σο",
"σογιων"=>"σο",
"τατογια"=>"τατο",
"τατογιου"=>"τατο",
"τατογιων"=>"τατο",
"κρεασ"=>"κρε",
"κρεατοσ"=>"κρε",
"κρεατα"=>"κρε",
"κρεατων"=>"κρε",
"περασ"=>"περ",
"περατοσ"=>"περ",
"περατα"=>"περ",
"περατων"=>"περ",
"τερασ"=>"τερ",
"τερατοσ"=>"τερ",
"τερατα"=>"τερ",
"τερατων"=>"τερ",
"φωσ"=>"φω",
"φωτοσ"=>"φω",
"φωτα"=>"φω",
"φωτων"=>"φω",
"καθεστωσ"=>"καθεστ",
"καθεστωτοσ"=>"καθεστ",
"καθεστωτα"=>"καθεστ",
"καθεστωτων"=>"καθεστ",
"γεγονοσ"=>"γεγον",
"γεγονοτοσ"=>"γεγον",
"γεγονοτα"=>"γεγον",
"γεγονοτων"=>"γεγον"
);
protected static $step1regexp="/(.*)(φαγια|φαγιου|φαγιων|σκαγια|σκαγιου|σκαγιων|ολογιου|ολογια|ολογιων|σογιου|σογια|σογιων|τατογια|τατογιου|τατογιων|κρεασ|κρεατοσ|κρεατα|κρεατων|περασ|περατοσ|περατα|περατων|τερασ|τερατοσ|τερατα|τερατων|φωσ|φωτοσ|φωτα|φωτων|καθεστωσ|καθεστωτοσ|καθεστωτα|καθεστωτων|γεγονοσ|γεγονοτοσ|γεγονοτα|γεγονοτων)$/u";
$w;
$stem="";
$suffix="";
$firstch="";
if (preg_match($step1regexp, $w, $fp)) {
$stem = $fp[1];
$suffix = $fp[2];
$w = $stem.$step1list[$suffix];
}
The latest thing i've tried is this (i don't rly have blah on the lists, they're the same as the php one):
import re
step1list = {
u"φαγια": u"φα",
blah blah blah blah
}
stem = ""
suffix=""
firstch=""
s = u"σογια"
reg = re.compile(r'/(.*)(φαγια|φαγιου|φαγιων|σκαγια|σκαγιου|σκαγιων|ολογιου|ολογια|ολογιων|σογιου|σογια|σογιων|τατογια|τατογιου|τατογιων|κρεασ|κρεατοσ|κρεατα|κρεατων|περασ|περατοσ|περατα|περατων|τερασ|τερατοσ|τερατα|τερατων|φωσ|φωτοσ|φωτα|φωτων|καθεστωσ|καθεστωτοσ|καθεστωτα|καθεστωτων|γεγονοσ|γεγονοτοσ|γεγονοτα|γεγονοτων)$');
m = reg.search(s)
if m:
stem = m.group(1);
suffix = m.group(2);
s = "{0}{1}".format(stem, step1list[suffix])
print(s)
print(stem)
print(suffix)
what I get as a result is:
σογια
(with 2 blank lines after it) which means that the 2 groups are not successfully identified :(
How do I mend this?