First off, this whole thing can be written far more nicely, by setting up variables for components of that overly long pattern and then using them in the pattern itself.
Practically all of the basic regex syntax that is supported in both of these languages is the same, or close enough, so that I won't list here what [..]
or \s
mean. What needs translation is the overall operation (operators, functions, etc), and the few flags that are used
The group of regex uses the substitution operator, $x =~ s/pattern/repl/
, whereby the substitutions are done on the variable $x
(and in-place). In python that is re.sub
The trailing modifiers /gis
in Perl regex mean: find and replace all occurrences of pattern (/g
), ignore the case (/i
), and make .
match anything (/s
), including the newline.
In python, to replace all occurrences of the pattern just leave out the count
(or set it to zero), which would've been the fourth argument in re.sub
below (between string
and flags), while for the other two there are flags:
IGNORECASE (or I) and DOTALL (or S)
Altogether we have
import re
result = re.sub(pattern, replacement, string, flags=re.I|re.S)
which returns the new string, unlike Perl's default in-place substitution, so assign re.sub
back to the string
if you wish to emulate the given regex.
In addition to the reference perlre
and the perlop
, linked above, some other useful resources for Perl regex are the tutorial perlretut and the quick reference perlreref,
Here is the first regex for a fuller example. I'd like to first rewrite it on the Perl side
# Opening "item", a phrase, and phrases with alternation
my $item = qr/(item\s+7[^0-9a-z\"]*management(?:[^0-9a-z]{0,3}s)?\s+/;
my $phrase = qr/discussions?\s+and\s+analysis\s+of\s+/;
my $pa1 = qr/(?:financial\s+conditions?\s+|results\s+of\s+operations?)/;
my $pa2 = qr/(?:\s+and\s+results\s+of\s+operations?|\s+and\s+financial\s+conditions?)?)/
$x =~ s/([^\"])$item$phrase$pa1$pa2/$1#######ITEM7:$2#######/gis;
I've used qr to construct a proper regex pattern (similar in spirit to re.compile
object in Python), while in this case an ordinary string would be fine as well.
I've replaced the long-outdated \1
and \2
on the replacement side by $1
and $2
. (The \1
is used as a back-reference for work in the matching side of the regex.)
In Python, with the gigantic pattern as given in the question
patt = re.compile("...", flags=re.I|re.S)
string = patt.sub(r"\g<1>#######ITEM7:\g<2>#######/", string)
or, better, first form subpatterns as above (ellipsis indicating that they need be completed)
item = "(item\s+..."
phrase = "discussions?..."
pa1 = "(?:financial\s..."
pa2 = "(?:\s..."
patt = re.compile(item+phrase+pa1+pa2, flags=re.I|re.S)
string = patt.sub(r"\g<1>#######ITEM7:\g<2>#######/", string)
The use of re.compile
is by no means compulsory; re.sub
(used directly in the beginning, for example) is most often exactly the same. But I consider re.compile
as a nice device for code organization (leaving the efficiency question aside).
If you're not in Python 3 (yet) you'll need re.compile
in order to use flags.
All of the pattern itself is the same in Python as far as I can see, so you can simply copy it.
An example: (?:[^0-9a-z]{0,3}s)?
works as follows
non-capturing (?: ... )
groups things (but doesn't store any), so one can make it ...
optional with (?: ... )?
with that last ?
(match 0 or 1 time, on the whole thing)
negated character class [^0-9a-z]
matches anything other than a digit or low-case letter ...
zero-to-three times with [^0-9a-z]{0,3}
(but no need for 0
as {3}
means the same)
the s
in the end is just the literal character s
Note that with the flag /i
(re.I
) the negated character class above excludes all letters.
The last statement with a regex
my @M = $y =~ m/(...)+/g;
matches all occurrences (/g
) of the given pattern in the string $y
(the match operator m//
is bound to $y
by =~
operator) and returns the list of matches, assigned to array @M
.
In Perl the match operator can return 1
or empty string (true/false) or a list with the actual matches, depending on what context it is in. Here the list context is imposed on it by the fact that the expression $y =~ m/.../
assigns to an array.
I've removed the unneeded parenthesis above and added declaration of the variable, my @M
. I don't see anything interesting in that long pattern so I'm leaving it out.
You get this in Python with the basic use of re.findall
Question's edit. The code
for($i = 0; $i < scalar(@X); ++$i)
iterates through indices of the array @X
, but a much nicer (and better) way is
for my $i (0..$#X)
using the syntax $#X
for the last index of @X
and the range operator n .. m
. The syntax $X[$i]
is for the element of array @X
which is at index $i
. Arrays in Perl are 0-based.
Then inside the loop there is a simple condition based on a regex match
if ( $X[$i] =~ m/^(ITEM(?:7|7A|8)):(.*)$/s )
where the match operator m//
here returns 1
/''
(true/false), being in the scalar context (condition of the if
statement ultimately needs a boolean value). So if there are matches the if
gets a non-zero number and evaluates as true, otherwise code drops to else
.
The modifier /s
, also seen in substitution regex, makes .
match newlines as well so that the whole pattern can match across lines in a multiline string.
In both if
-else
branches elements of yet other arrays are set (@Z
and @Y
), and if there was a match then the patterns captured by the regex are used ($1
and $2
).
Finally, the .
is concatenation operator and the expression $i . ':' . $1
joins the value of $i
, literal :
, and (the first capture) $1
. The length_in_words()
is a subroutine.
Edit: The subroutine length_in_words()
has now been added to the question.
In short: the subroutine takes a string and returns the number of words in it.
The shift removes the first element from an array. By default it does this to @_
(when in a sub), the array with function arguments. So $x
is the input string, what the function has been called with.
The regex matches all words (\S+
under /g
modifier) in $x
and returns that list, which is assigned to array @k
. Then scalar
takes the number of elements in an array, what is returned.