-2

I have a bit of an issue with regex in python, I am familiar with this regex script in PHP: https://gist.github.com/benvds/350404, but in Python, using the re module, I keep getting no results:

re.findall(r"#^([\w[:punct:] ]+) ([0-9]{1,5})([\w[:punct:]\-/]*)$#", "Wilhelminakade 173")

Output is []

Any ideas?

Tomalak
  • 332,285
  • 67
  • 532
  • 628
Guus Huizen
  • 165
  • 1
  • 2
  • 9
  • 3
    Can you give expected input output examples? There are some behavior differences with parenthesis and brackets – thethiny Feb 25 '21 at 12:11
  • Also please remove the Jupyter artifacts from your code sample. The way you run your code is besides the point, this format makes it harder for everybody to copy your code, so there is no reason to leave that in. – Tomalak Feb 25 '21 at 12:13
  • @thetiny please check the gist, – Guus Huizen Feb 25 '21 at 12:14
  • @thethiny I want to at least receive Wilhelminakade and 173 in separate groups, but when for example "Wilhelminakade 173c" is given as input, I want to retrieve Wilhelminakade, 173 and c as separate inputs – Guus Huizen Feb 25 '21 at 12:20
  • What's the purpose of '#' before '^' and after '$'? It doesn't seem to be Python regex dialect, as far as I can tell – jetpack_guy Feb 25 '21 at 12:22
  • See [Does REGEX differ from PHP to Python](https://stackoverflow.com/questions/3070655/does-regex-differ-from-php-to-python). – Wiktor Stribiżew Feb 25 '21 at 12:33
  • Grouping in Python Regex is different. I don't remember the exact syntax but you had to name every group you want and then index it by name, while in PHP you can just \0 \1 \2. – thethiny Feb 25 '21 at 12:35

1 Answers1

1

PHP supports alternative characters as regex delimiters. Your sample Gist uses # for that purpose. They are not part of the regex in PHP, and they are not needed in Python at all. They prevent a match. Remove them.

re.findall(r"^([\w[:punct:] ]+) ([0-9]{1,5})([\w[:punct:]\-/]*)$", "Wilhelminakade 173")

This still gives no result because Python regex does not know what [:punct:] is supposed to mean. There is no support for POSIX character classes in Python's re. Replace them with something else (i.e. the punctuation you expect, probably something like "dots, apostrophes, dashes"). This results in

re.findall(r"^([\w.'\- ]+) ([0-9]{1,5})([\w.'\-/]*)$", "Wilhelminakade 173")

which gives [('Wilhelminakade', '173', '')].

Long story short, there are different regex engines in different programming languages. You cannot just copy regex from PHP to Python without looking at it closely, and expect it to work.

Tomalak
  • 332,285
  • 67
  • 532
  • 628