1

I need to convert some regex from perl to python, but I am not that familiar with perl regex.

I have the following:

$x =~ s/([^\"])(item\s+7[^0-9a-z\"]*management(?:[^0-9a-z]{0,3}s)?\s+discussions?\s+and\s+analysis\s+of\s+(?:financial\s+conditions?\s+|results\s+of\s+operations?)(?:\s+and\s+results\s+of\s+operations?|\s+and\s+financial\s+conditions?)?)/\1#######ITEM7:\2#######/gis;

$x =~ s/([^\"])(item\s+7[^0-9a-z\"]*a[^0-9a-z\"]*(?:quantitative\s+and\s+(?:qualitative|qualification)\s+disclosures?\s+about\s+)?market\s+risk)/\1#######ITEM7A:\2#######/gis;

$x =~ s/([^\"])(item\s+8[^0-9a-z\"]*.{0,40}financial\s+statements[^\.])/\1#######ITEM8:\2#######/gis;

@X = (split /\#\#\#\#\#\#\#/, $x)

I believe that s/ is equivalent to python re.split but I'm not sure what /gis does though.

Also, I am not sure what this means either:

(@M) = ($y =~ m/((?:\d+:ITEM7 \d+:\d+ )+(?:\d+:ITEM7A \d+:\d+ )*)(?:\d+:ITEM8 \d+:\d+\s*)+/g)

I would greatly appreciate the help !

EDIT:

Just another quick question, what exactly does:

for($i = 0; $i < scalar(@X); ++$i) {
  if($X[$i] =~ m/^(ITEM(?:7|7A|8)):(.*)$/s) {
    $Z[$i] = $2; 
    $Y[$i] = $i . ':' . $1; 
  } else {   
    $Z[$i] = $X[$i]; 
    $Y[$i] = $i . ':' . length_in_words($X[$i]);  
  }
}

sub length_in_words {
  my $x = shift;
  my @k;
  return scalar(@k = $x =~ m/(\S+)/sg);
}
PutsandCalls
  • 997
  • 1
  • 8
  • 11
  • `s/` means `substitute` (replace). `s/old/new/` - replace `old` with `new`, `m/` means `match` (check if you can find), flag `g` = `global` (replace all `old`, not only first) , flag `i` = `insensitive` / `ignore case` (treads uppercase chars and lowercase chars as the same = `old`, `OLD`, `oLd`, `Old`, etc.), `s` = I don't know. Most regex should be very similar. – furas Apr 25 '19 at 03:21
  • ok, what does this mean by the way: @X = (split /\#\#\#\#\#\#\#/, $x) in the first part. For example: `$x =~ s/([^\"])(item\s+8[^0-9a-z\"]*.{0,40}financial\s+statements[^\.])/\1#######ITEM8:\2#######/gis; @X = (split /\#\#\#\#\#\#\#/, $x)` How do i use re.split with the `/\#\#\#\#\#\#\#/` – PutsandCalls Apr 25 '19 at 04:05
  • `X = re.split('\#\#\#\#\#\#\#', x)` - almost the same - only without `/` at both sides – furas Apr 25 '19 at 06:24

2 Answers2

5

First off, this whole thing can be written far more nicely, by setting up variables for components of that overly long pattern and then using them in the pattern itself.

Practically all of the basic regex syntax that is supported in both of these languages is the same, or close enough, so that I won't list here what [..] or \s mean. What needs translation is the overall operation (operators, functions, etc), and the few flags that are used

  • The group of regex uses the substitution operator, $x =~ s/pattern/repl/, whereby the substitutions are done on the variable $x (and in-place). In python that is re.sub

  • The trailing modifiers /gis in Perl regex mean: find and replace all occurrences of pattern (/g), ignore the case (/i), and make . match anything (/s), including the newline.

    In python, to replace all occurrences of the pattern just leave out the count(or set it to zero), which would've been the fourth argument in re.sub below (between string and flags), while for the other two there are flags: IGNORECASE (or I) and DOTALL (or S)

Altogether we have

import re

result = re.sub(pattern, replacement, string, flags=re.I|re.S)

which returns the new string, unlike Perl's default in-place substitution, so assign re.sub back to the string if you wish to emulate the given regex.

In addition to the reference perlre and the perlop, linked above, some other useful resources for Perl regex are the tutorial perlretut and the quick reference perlreref,


Here is the first regex for a fuller example. I'd like to first rewrite it on the Perl side

# Opening "item", a phrase, and phrases with alternation
my $item   = qr/(item\s+7[^0-9a-z\"]*management(?:[^0-9a-z]{0,3}s)?\s+/;
my $phrase = qr/discussions?\s+and\s+analysis\s+of\s+/;
my $pa1 = qr/(?:financial\s+conditions?\s+|results\s+of\s+operations?)/;
my $pa2 = qr/(?:\s+and\s+results\s+of\s+operations?|\s+and\s+financial\s+conditions?)?)/

$x =~ s/([^\"])$item$phrase$pa1$pa2/$1#######ITEM7:$2#######/gis;

I've used qr to construct a proper regex pattern (similar in spirit to re.compile object in Python), while in this case an ordinary string would be fine as well.

I've replaced the long-outdated \1 and \2 on the replacement side by $1 and $2. (The \1 is used as a back-reference for work in the matching side of the regex.)

In Python, with the gigantic pattern as given in the question

patt = re.compile("...", flags=re.I|re.S)

string = patt.sub(r"\g<1>#######ITEM7:\g<2>#######/", string)

or, better, first form subpatterns as above (ellipsis indicating that they need be completed)

item   = "(item\s+..."
phrase = "discussions?..."
pa1    = "(?:financial\s..."
pa2    = "(?:\s..."

patt = re.compile(item+phrase+pa1+pa2, flags=re.I|re.S)

string = patt.sub(r"\g<1>#######ITEM7:\g<2>#######/", string)

The use of re.compile is by no means compulsory; re.sub (used directly in the beginning, for example) is most often exactly the same. But I consider re.compile as a nice device for code organization (leaving the efficiency question aside).

If you're not in Python 3 (yet) you'll need re.compile in order to use flags.

All of the pattern itself is the same in Python as far as I can see, so you can simply copy it.

An example: (?:[^0-9a-z]{0,3}s)? works as follows

  • non-capturing (?: ... ) groups things (but doesn't store any), so one can make it ...

  • optional with (?: ... )? with that last ? (match 0 or 1 time, on the whole thing)

  • negated character class [^0-9a-z] matches anything other than a digit or low-case letter ...

  • zero-to-three times with [^0-9a-z]{0,3} (but no need for 0 as {3} means the same)

  • the s in the end is just the literal character s

Note that with the flag /i (re.I) the negated character class above excludes all letters.


The last statement with a regex

my @M =  $y =~ m/(...)+/g;

matches all occurrences (/g) of the given pattern in the string $y (the match operator m// is bound to $y by =~ operator) and returns the list of matches, assigned to array @M.

In Perl the match operator can return 1 or empty string (true/false) or a list with the actual matches, depending on what context it is in. Here the list context is imposed on it by the fact that the expression $y =~ m/.../ assigns to an array.

I've removed the unneeded parenthesis above and added declaration of the variable, my @M. I don't see anything interesting in that long pattern so I'm leaving it out.

You get this in Python with the basic use of re.findall


Question's edit.   The code

for($i = 0; $i < scalar(@X); ++$i)

iterates through indices of the array @X, but a much nicer (and better) way is

for my $i (0..$#X)

using the syntax $#X for the last index of @X and the range operator n .. m. The syntax $X[$i] is for the element of array @X which is at index $i. Arrays in Perl are 0-based.

Then inside the loop there is a simple condition based on a regex match

if ( $X[$i] =~ m/^(ITEM(?:7|7A|8)):(.*)$/s )

where the match operator m// here returns 1/'' (true/false), being in the scalar context (condition of the if statement ultimately needs a boolean value). So if there are matches the if gets a non-zero number and evaluates as true, otherwise code drops to else.

The modifier /s, also seen in substitution regex, makes . match newlines as well so that the whole pattern can match across lines in a multiline string.

In both if-else branches elements of yet other arrays are set (@Z and @Y), and if there was a match then the patterns captured by the regex are used ($1 and $2).

Finally, the . is concatenation operator and the expression $i . ':' . $1 joins the value of $i, literal :, and (the first capture) $1. The length_in_words() is a subroutine.


Edit:   The subroutine length_in_words() has now been added to the question.

In short: the subroutine takes a string and returns the number of words in it.

The shift removes the first element from an array. By default it does this to @_ (when in a sub), the array with function arguments. So $x is the input string, what the function has been called with.

The regex matches all words (\S+ under /g modifier) in $x and returns that list, which is assigned to array @k. Then scalar takes the number of elements in an array, what is returned.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • Thanks for the answer, just for the last part should I be using re.findall or re.search ? I believe I read somewhere that /m is the equivalent of re.search – PutsandCalls Apr 25 '19 at 18:03
  • @PutsandCalls Oh, right -- Python code needs `import re` (is that what you ask?). I thought that you were fine with Python regex (and left it out). It should be there though and I will add it when I edit (shortly). Thank you for calling it. – zdim Apr 25 '19 at 18:07
  • I think I understand now, thank you for the help. Would you mind just helping clarify one more question in my original post ? I added an EDIT section, and I'm a bit confused by the for loop. I would greatly appreciate the help ! In particular my confusion is about `if($X[$i] =~ m/^(ITEM(?:7|7A|8)):(.*)$/s)` what exactly does this return ?. Also, what does this syntax mean `$Y[$i] = $i . ':' . $1;` – PutsandCalls Apr 25 '19 at 18:52
  • @PutsandCalls Edited and added a bit. Also added a section at the end for your edit. (But let's not add yet more so that this doesn't go off the rails :). Let me know if things are unclear etc. – zdim Apr 25 '19 at 19:27
  • @PutsandCalls I've noticed only now (and accidentally!) that you added that subroutine to your "Edit" (it wasn't there initially?). So I added a comment on it to the end of the answer -- but by now you have full answers to that from your follow-up question. I hope that they are good :) – zdim May 02 '19 at 04:37
0

Imho the regex engine actually the same as python derived PCRE The differences is in results related utilization/replacement

s///g substitution global occurrences, would be re.sub() s/// " for # times re.sub() with specified # but re.sub() returns a new string, does not modify its argument like in perl a =~ s///

with i option is case-insensitive would be re.IGNORECASE or re.I
s single-line ie. entire line treated as one pattern space at once
- I admit I've no idea this would be in Python

(@M) = ($y =~ m/((?:\d+:ITEM7 \d+:\d+ )+(?:\d+:ITEM7A \d+:\d+ )*)(?:\d+:ITEM8 \d+:\d+\s*)+/g)

Imho is simply

@arr = $y =~ /((?:\d+:ITEM7 \d+:\d+ )+(?:\d+:ITEM7A \d+:\d+ ))(?:\d+:ITEM8 \d+:\d+\s)+/g

instructs perl:
match the pattern against whatever y variable contains and put all the succeeding results in arr array
with @arr[0] assigned the first captured group, @arr[1] the second one, and so on and the last is the whole match without touch anything in original y var, in the case is only 1 captured group as (?:) capture none. Do it on global occurrences till the end of pattern space.

but for subsitution - assume a bit varied of the case : $b = $y =~ s/((?:\d+:ITEM7 \d+:\d+ )+(?:\d+:ITEM7A \d+:\d+ )*)(?:\d+:ITEM8 \d+:\d+\s*)+/TEST_\1/g

match the pattern against whatever y variable contains and substitute the succeeding results with TEST_\1 (which \1 is meant to be substituted with the 1st capture group ()), overwriting onto original y variable, continue doing such also set it as the new pattern space, assigned the booelan true, or T or 1 into b var, if does not succeed leave y as is, assigned
booelan false to b