1

Is there any way (a library, an algorithm, etc.) to identify and extract regular expressions from a (non-structured and random) string? For example, I am given the following string:

$betterline = ($line -match "\(\d+\)(?:\w+\(\d+\))+$") -replace "\(\d+\)", "."

and I would like to detect (\d+)(?:\w+(\d+))+$ and (\d+). Even approximating solutions should do fine. I prefer python, but I can use other languages as well.

Kiarahmani
  • 47
  • 1
  • 8
  • 1
    Python has the `re` library which you can import - [Documentation](https://docs.python.org/3.8/library/re.html). – Philip Ciunkiewicz Jul 09 '20 at 16:21
  • Thanks for you reply. Which function exactly are you referring to? I know I can compile the text and see if it throws an exception or not, but that's not exactly what I need. I need to find the substrings that would be accepted as valid regexes. – Kiarahmani Jul 09 '20 at 16:22
  • Use `re.findall` – jdaz Jul 09 '20 at 16:29
  • Are you using Python or powershell? – Paolo Jul 09 '20 at 16:29
  • 3
    What would you like to extract from the following string? `'Maggie "Pepper" Jones asked about "(\d+", 'cat \"and\" the hat' and "$2.34"'`. Whenever you give an example please show the desired result. – Cary Swoveland Jul 09 '20 at 16:30
  • 1
    Will the input really be unstructured and random or will it be PowerShell code as in your example? And if it won't always be PowerShell code, will it at least be code in some language? Or to get at the real question: Can we assume that the regex will always be surrounded by delimiters, such as `"`, `'` or `/`? Can we assume that a specific syntax dialect will always be used? – sepp2k Jul 09 '20 at 16:47
  • The input is unstructured and random. I understand that there cannot exist a regex to match only valid regexes in the text (https://stackoverflow.com/questions/172303/is-there-a-regular-expression-to-detect-a-valid-regular-expression?rq=1) What I am looking for is an approximate solution, that when the input includes a pattern which is very likely to be a regex surrounded by non-regex, it would return the former. In the example above since no such case exists the solution should return the whole string. – Kiarahmani Jul 09 '20 at 17:12
  • I am trying to go through my examples and understand what patterns exists and clean my data manually, e.g. by common delimters, etc. I was wondering maybe there exists an automated tool. – Kiarahmani Jul 09 '20 at 17:16
  • 1
    @Kiarahmani Is "all substrings that are valid regexes" even really what you want though? Because in your example there are plenty more substrings that would make valid regular expressions: `betterline` is a valid regex for example (or even `$betterline` if you ignore the fact that it can never match or if `$` has no special meaning in your dialect) - so is `-match` etc. In fact if you ignore the fact that nothing can match after `$` (or if you assume `$` has no special meaning), the entire line is one valid regex. – sepp2k Jul 09 '20 at 17:17
  • @sepp2k You are right. I am starting to have a better idea of how to do it. Let's assume we have literal regexes and non-literal regexes. The later must include some quantifiers and/or character sets. My goal is to extract non-literal regexes from large strings of random stuff. I guess I will just find substrings enclosed by certain delimiters, e.g. " or / – Kiarahmani Jul 09 '20 at 17:23

1 Answers1

0

This page explains your case very well.

As an example, which is taken from that page, you can review the code below.

>>> import re
>>> p = re.compile('[a-z]+')
>>> p
re.compile('[a-z]+')

>>> m = p.match('tempo')
>>> m
<re.Match object; span=(0, 5), match='tempo'>
  • 1
    I don't see how the linked page or the code you posted relate to extracting regular expressions from a string. – sepp2k Jul 09 '20 at 20:28