2

I'd like to parse a list of regular expressions to calculate the likelihood of each to find a match to it in a certain text/string...

Eg. finding '[AB]' in a string of length 1 should be something around 1/13 (considering only captial letters).

Is there a generic regex parser, which returns the individual positions/alternatives? I'm thinking of getting a list of positions as return ('[AB].A{2}' would yield '[['A','B'],'.',['AA']')

The problem is the parsing of regular expressions with pyparsing. Simple regexes are no problem, but when it comes to "alternatives" and repetitions, I'm lost: I find it hard to parse nested expressions like '((A[AB])|(AB))'.

Any thoughts?

Hans
  • 21
  • 1
  • I wrote a Code Golf on this some time ago (http://stackoverflow.com/questions/3523323/code-golf-regex-parser). Being code golf, most answers will be a little hard to decipher. But the same problem came up, and people much more intelligent than I'll ever be found a way. :-) – Platinum Azure Nov 23 '10 at 19:53
  • I offered an answer to your likelihood question, and now I see you have a second question about the existence of a regex parser. There must be at least one such parser, the one Python uses. You'll probably get your answer from a Python expert if you clearly label your question as being about Python internals. – Narveson Nov 28 '10 at 13:34
  • Have you looked at this example on the pyparsing wiki? http://pyparsing.wikispaces.com/file/view/invRegex.py – PaulMcG Nov 29 '10 at 11:32

2 Answers2

2

Simulation rather than calculation may be the way to go.

Set up a population of representative text strings. (Linguists would call such a set a corpus.) For any given regex, find the number of strings it matches, and divide by the total number of strings in your corpus.

Your own example giving the likelihood of '[AB]' as 1/13 is based on this way of thinking, using the corpus of single-capital-letter strings. You got 1/13 by seeing that there are two matches out of the 26 strings in the corpus.

Create a larger corpus: maybe the set of all alphanumeric strings up to a certain length, or all ASCII strings up to a certain length, or the dictionary of your choice. Thinking about what corpus best suits your purpose is a good way to clarify what you mean by "likelihood".

Narveson
  • 1,091
  • 1
  • 9
  • 15
  • thanks for this interesting approach! However, I was interested in ways to parse complex regexes rather than the scoring (which should be more or less trivial as my set of letters is quite limited and their individual probabilities is known)... – Hans Nov 24 '10 at 10:15
0

You use ['A', 'B'] to say: or A or B. then you can put some thing like this:

'[{'A', ['A', 'B']}, {'A', 'B'}]'

At there you use [] to "one of these" as use {} to "all these"

1/2 to '{'A', ['A', 'B']}'
   'A' => 1/1
   ['A', 'B'] => 1/2
   (1/1) * (1/2) = 1/2
   this (1/2) times the extern (1/2) = (1/4)
1/2 to '{'A', 'B'}' -> (1/26) to each.
Multiplify two times: 1/(26^2) and multiplify by the 1/2 = (1/(26^2))/2.

Now multiplify both:  (1/4) * ((1/(26^2))/2)

It was a so bad explanation... I'll retry...

[] => Calc de probability: {probability of each term} / {num of terms}
{} => Calc de probability of each term and multiplify all

understand?

Guilherme Bernal
  • 8,183
  • 25
  • 43
  • Thanks for sharing this, however my problem is rather how to parse the complex regex and get to individual positions of the regex rather than how to score each individual position... – Hans Nov 24 '10 at 10:17