3

In python, I am looking for python code which I can use to create random data matching any regex. For example, if the regex is

\d{1,100}

I want to have a list of random numbers with a random length between 1 and 100 (equally distributed)

There are some 'regex inverters' available (see here) which compute ALL possible matches, which is not what I want, and which is extremely impracticable. The example above, for example, has more then 10^100 possible matches, which never can be stored in a list. I just need a function to return a match by random.

Maybe there is a package already available which can be used to accomplish this? I need a function that creates a matching string for ANY regex, not just the given one or some other, but maybe 100 different regex. I just cannot code them myself, I want the function extract the pattern to return me a matching string.

Community
  • 1
  • 1
Alex
  • 41,580
  • 88
  • 260
  • 469
  • Is `\d{1,100}` the real example? Could you paste some that you work with in real? If the regex is not *too* fuzzy, you can generate random string that is close to the matching expression and discard it if it does not match. – Jakub M. Jul 31 '13 at 05:46
  • 1
    The example given is 'a' example. I do not know yet what I will need exactly. It can be much more complex like `[A-C]{2}\d{2,20}@\w{10,1000}`. – Alex Jul 31 '13 at 05:49
  • 1
    What sort of distribution do you want over the possible strings? For your example, for instance, would you want most of the random results to be 100 digits long (since 90% of the valid matching strings will have that many)? Or would you want each length to occur with equal chance? – Blckknght Jul 31 '13 at 05:58
  • The latter. Each length should occur with equal chances. – Alex Jul 31 '13 at 06:01
  • 1
    Well, you can't possibly solve this for *all* legal regex patterns, since anything that can match an infinite length string (like `\d*`) would have an infinite number of possible lengths to chose from. But if you limit the regex syntax that's allowed a bit, it's probably doable. – Blckknght Jul 31 '13 at 06:12
  • Yes that is true, and a limited syntax would be good enough. – Alex Jul 31 '13 at 06:13
  • Good point with `\d*`! also, `[^x]` has infinite solutions, or even `.` – Jakub M. Jul 31 '13 at 06:14
  • Possible duplicate of [Reversing a regular expression in Python](https://stackoverflow.com/questions/492716/reversing-a-regular-expression-in-python) – Anderson Green Feb 10 '18 at 20:24

3 Answers3

2

If the expressions you match do not have any "advanced" features, like look-ahead or look-behind, then you can parse it yourself and build a proper generator

Treat each part of the regex as a function returning something (e.g., between 1 and 100 digits) and glue them together at the top:

import random
from string import digits, uppercase, letters

def joiner(*items):
    # actually should return lambda as the other functions
    return ''.join(item() for item in items)  

def roll(item, n1, n2=None):
    n2 = n2 or n1
    return lambda: ''.join(item() for _ in xrange(random.randint(n1, n2)))

def rand(collection):
    return lambda: random.choice(collection)

# this is a generator for /\d{1,10}:[A-Z]{5}/
print joiner(roll(rand(digits), 1, 10),
             rand(':'),
             roll(rand(uppercase), 5))

# [A-C]{2}\d{2,20}@\w{10,1000}
print joiner(roll(rand('ABC'), 2),
             roll(rand(digits), 2, 20),
             rand('@'),
             roll(rand(letters), 10, 1000))

Parsing the regex would be another question. So this solution is not universal, but maybe it's sufficient

Jakub M.
  • 32,471
  • 48
  • 110
  • 179
  • This is pretty close to what I was going to say in my own answer. Yes, it can be done, but just parsing the regex pattern is enough of a challenge that I don't think any random SO answerer is going to produce working code. – Blckknght Jul 31 '13 at 06:08
  • @Blckknght: parsing a regex would be difficult but definitely doable, you can use regex for that :) – Jakub M. Jul 31 '13 at 12:12
2

Two Python libraries can do this: sre-yield and Hypothesis.

  1. sre-yield

sre-yeld will generate all values matching a given regular expression. It uses SRE, Python's default regular expression engine.

For example,

import sre_yield
list(sre_yield.AllStrings('[a-z]oo$'))
['aoo', 'boo', 'coo', 'doo', 'eoo', 'foo', 'goo', 'hoo', 'ioo', 'joo', 'koo', 'loo', 'moo', 'noo', 'ooo', 'poo', 'qoo', 'roo', 'soo', 'too', 'uoo', 'voo', 'woo', 'xoo', 'yoo', 'zoo']

For decimal numbers,

list(sre_yield.AllStrings('\d{1,2}'))
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99']
  1. Hypothesis

The unit test library Hypothesis will generate random matching examples. It is also built using SRE.

import hypothesis
g=hypothesis.strategies.from_regex(r'^[A-Z][a-z]$')
g.example()

with output such as:

'Gssov', 'Lmsud', 'Ixnoy'

For decimal numbers

d=hypothesis.strategies.from_regex(r'^[0-9]{1,2}$')

will output one or two digit decimal numbers: 65, 7, 67 although not evenly distributed. Using \d yielded unprintable strings.

Note: use begin and end anchors to prevent extraneous characters.

Brad Schoening
  • 1,281
  • 6
  • 22
  • `k=list(sre_yield.AllStrings('[a-zA-Z]\d{7}'))` Is there way to `limit the numbers` as to generate `9 digit ` it would take lot of time. – MAC Aug 23 '22 at 08:22
0

From this answer

You could try using python to call this perl module:

https://metacpan.org/module/String::Random

Community
  • 1
  • 1
John Jiang
  • 11,069
  • 12
  • 51
  • 60
  • 1
    I would prefer a python only solution, as the perl solution requires (i) to fix a problem with an String/Random import (ii) to pass on parameters to the perl function (iii) to return the output and (iv) probably other issues concerning running this piece of code on different machines, Linux and Windows. – Alex Jul 31 '13 at 06:00