2

I have a list of regexes in string form (created after parsing natural language text which were search queries). I want to use them for searching text now. Here is how I am doing it right now-

# given that regex_list=["r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))'", "r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'"....]
sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
    mo=re.search(string_regex %gs1,sent,re.I)
    if mo:
        print(mo.group())

What I need is to be able to use these string regexes, but also have Python's raw literal notation on them, as we all should for regex queries. Now about these expressions - I have natural text search commands like -

LINE_CONTAINS foo(+)

Which I use pyparsing to convert to regex like r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))' based on a grammar. I send a list of these human rules to the pyparsing code and it gives me back a list of ~100 of these regexes. These regexes are constructed in string format.

This is the MCVE version of the code that generates these strings that are supposed to act as regexes -

from pyparsing import *
import re


def parse_hrr(received_sentences):
    UPTO, AND, OR, WORDS, CHARACTERS = map(Literal, "UPTO AND OR WORDS CHARACTERS".split())
    LBRACE,RBRACE = map(Suppress, "{}")
    integer = pyparsing_common.integer()

    LINE_CONTAINS, PARA_STARTSWITH, LINE_ENDSWITH = map(Literal,
        """LINE_CONTAINS PARA_STARTSWITH LINE_ENDSWITH""".split()) # put option for LINE_ENDSWITH. Users may use, I don't presently
    keyword = UPTO | WORDS | AND | OR | BEFORE | AFTER | JOIN | LINE_CONTAINS | PARA_STARTSWITH


    class Node(object):
        def __init__(self, tokens):
            self.tokens = tokens

        def generate(self):
            pass

    class LiteralNode(Node):
        def generate(self):
            return "(%s)" %(re.escape(''.join(self.tokens[0]))) # here, merged the elements, so that re.escape does not have to do an escape for the entire list
        def __repr__(self):
            return repr(self.tokens[0])

    class ConsecutivePhrases(Node):
        def generate(self):
            join_these=[]
            tokens = self.tokens[0]
            for t in tokens:
                tg = t.generate()
                join_these.append(tg)
            seq = []
            for word in join_these[:-1]:
                if (r"(([\w]+\s*)" in word) or (r"((\w){0," in word): #or if the first part of the regex in word:
                    seq.append(word + "")
                else:
                    seq.append(word + "\s+")
            seq.append(join_these[-1])
            result = "".join(seq)
            return result

    class AndNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            join_these=[]
            for t in tokens[::2]:
                tg = t.generate()
                tg_mod = tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)' # to place the regex commands at the right place
                join_these.append(tg_mod)
            joined = ''.join(ele for ele in join_these)
            full = '('+ joined+')'
            return full

    class OrNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            joined = '|'.join(t.generate() for t in tokens[::2])
            full = '('+ joined+')'
            return full

    class LineTermNode(Node):
        def generate(self):
            tokens = self.tokens[0]
            ret = ''
            dir_phr_map = {
                'LINE_CONTAINS': lambda a:  r"((?:(?<=[\W_])" + a + r"(?=[\W_]|$))456", #%gs1, sent, re.I)",
                'PARA_STARTSWITH':
                    lambda a: ("r'(^" + a + "(?=[\W_]|$))' 457") if 'gene' in repr(a) #%gs1, s, re.I)"
                    else ("r'(^" + a + "(?=[\W_]|$))' 458")} #,s, re.I
            for line_dir, phr_term in zip(tokens[0::2], tokens[1::2]):
                ret = dir_phr_map[line_dir](phr_term.generate())
            return ret

## THE GRAMMAR
    word = ~keyword + Word(alphas, alphanums+'-_+/()')
    some_words = OneOrMore(word).setParseAction(' '.join, LiteralNode)
    phrase_item = some_words

    phrase_expr = infixNotation(phrase_item,
                                [
                                (None, 2, opAssoc.LEFT, ConsecutivePhrases),
                                (AND, 2, opAssoc.LEFT, AndNode),
                                (OR, 2, opAssoc.LEFT, OrNode),
                                ],
                                lpar=Suppress('{'), rpar=Suppress('}')
                                ) # structure of a single phrase with its operators

    line_term = Group((LINE_CONTAINS|PARA_STARTSWITH)("line_directive") +
                      (phrase_expr)("phrases")) # basically giving structure to a single sub-rule having line-term and phrase

    line_contents_expr = line_term.setParseAction(LineTermNode)
###########################################################################################
    mrrlist=[]
    for t in received_sentences:
        t = t.strip()
        try:
            parsed = line_contents_expr.parseString(t)

        temp_regex = parsed[0].generate()
        mrrlist.append(temp_regex)
    return(mrrlist)

So basically, the code is stringing together the regex. Then I add the necessary parameters like re.search, %gs1 etc .to have the complete regex search query. I want to be able to use these string regexes for searching, hence I had earlier thought eval() would convert the string to its corresponding Python expression here, which is why I used it - I was wrong.

TL;DR - I basically have a list of strings that have been created in the source code, and I want to be able to use them as regexes, using Python's raw literal notation.

user1993
  • 498
  • 1
  • 10
  • 22
  • 1
    What's preventing you from changing `eval(string_regex)` to `string_regex[2:-1]`? And why are you generating your regex with a `r'` prefix in the first place? Currently it's entirely unclear why you need `eval`. Perhaps go into more detail about how you're creating your regex patterns. – Aran-Fey Jun 17 '17 at 07:44
  • @Rawing, that worked. I had been trying with `string_regex[1:-1]` all this while. But doing `[2:-1]` removes the `r` from the beginning, does it mean the regex won't be treated as a raw literal? – user1993 Jun 17 '17 at 07:50
  • Also, I am pretty certain this question is not the same as the question mentioned above (whose duplicate this question supposedly is). That question asks about using a variable within a regex, but I don't have that problem at all – user1993 Jun 17 '17 at 07:52
  • That's right, it won't be treated as a raw literal. In fact it won't be treated as a literal at all. It'll be treated as a string. What you're doing is calling `eval` on a string to turn it into a different string. There are easier and safer ways to do that, namely string operations. – Aran-Fey Jun 17 '17 at 07:56
  • I'm sure we can re-open your question if you explain in detail why you believe you need `eval`. How do you generate `string_regex`? It's not a constant like in the code snippet you posted, is it? – Aran-Fey Jun 17 '17 at 07:57
  • @Rawing, I have done that. Please have a look – user1993 Jun 17 '17 at 08:10
  • You must be doing something different than what you're showing, or using `exec` doesn't make any sense. You could just remove the outer quotation marks and skip the `eval`. Try `string_regex=r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))'` You'd only need to `eval` if you were putting the `%gs1` in the string. – Blckknght Jun 17 '17 at 08:23
  • @Blckknght, thank you for your response. As I mention in the question edit, removing just the outer quotes using `re.search(string_regex[1:-1] %gs1,sent,re.I)` does not work. However `string_regex[2:-1]` works, but it's basically getting rid of the `r` at the beginning which is crucial since I want the regex to be treated as a raw literal – user1993 Jun 17 '17 at 08:39
  • 1
    What you're doing is like doing `string_regex = eval("r'\t'")` instead of just `string_regex = r'\t'`. It doesn't make any sense. – Aran-Fey Jun 17 '17 at 08:41
  • @Rawing, yes I think eval may not be helping here - I have changed the title and question accordingly. But what *can* help? – user1993 Jun 17 '17 at 08:48
  • 1
    Actually @Rawing, your example is not a good one, since `eval("r'\t'")` gets a real tab character in it (while `r'\t'` has just a backslash and a lowercase "T"). That's not the case with the string in the question though, as the regex pattern doesn't contain any escaped characters that Python cares about. – Blckknght Jun 17 '17 at 08:50
  • 1
    @user1993: I think the issue is that you seem to think that a raw string is something other than a normal string. It's not, its just a way to write a string literal without Python converting escape sequences like `\t` or `\n` into special characters. Since regex patterns often use lots of backslashes for their own purposes (escaping things that have meaning in regex), it's considered a best practice to always use raw string literals when defining patterns. But you still just get a normal string in the end. – Blckknght Jun 17 '17 at 08:53
  • @Blckknght, ok so 1) are you suggesting that it will be fine even if the regex I search for is not read as raw literal (i.e. with having the `r` in front of it) but as normal. And 2) is there a way to have the `r` placed before the regex in my case above, given that the regex queries are variables(i.e. elements of a list)? – user1993 Jun 17 '17 at 09:03
  • Some string literals with backslashes in them are fine even if they are not written as raw strings. The patterns you've shown in your code are examples (though many other regex patterns may not work right without extra backslashes if they're not raw strings... The `\b` and `\1` escape codes are notorious for causing issues). But I still think you're not understanding raw string literals properly. By the time the strings are in the list, they should have already been parsed. By putting extra quotation marks around your literals, you're messing everything up. Don't do that! – Blckknght Jun 17 '17 at 09:18
  • I suggest you forget everything about `eval` and solve this problem at its root - your list shouldn't be in the form `["r'a'", "r'b'", ...]`, it should be `['a', 'b', ...]`. I ask once again, _where do these regexes come from_? Are they hardcoded into your program? Are you reading them from a file? And it would be a good idea to discuss this [in chat](http://chat.stackoverflow.com/rooms/6/python). – Aran-Fey Jun 17 '17 at 09:36
  • @Blckknght, you wrote- `By the time the strings are in the list, they should have already been parsed`. Let me tell you how I these strings are formed. I take an regex query in english language, identify the various units it has and using them I knit together a string. [Code](https://pastebin.com/NEetnkPN) - I don't know if it helps. So, now these fabricated strings are sent back to search on the text. Since these queries have all kinds of things like `{},(),[]`([list here](https://pastebin.com/zLjJQcgH)) I was thinking it important to have the raw literalization – user1993 Jun 17 '17 at 09:40
  • Change all lines in your file from `r'something'` to just `something` and remove the `eval` from your code, and everything will work. – Aran-Fey Jun 17 '17 at 09:42

1 Answers1

1

Your issue seems to stem from a misunderstanding of what raw string literals do and what they're for. There's no magic raw string type. A raw string literal is just another way of creating a normal string. A raw literal just gets parsed a little bit differently.

For instance, the raw string r"\(foo\)" can also be written "\\(foo\\)". The doubled backslashes tell Python's regular string parsing algorithm that you want an actual backslash character in the string, rather than the backslash in the literal being part of an escape sequence that gets replaced by a special character. The raw string algorithm doesn't the extra backslashes since it never replaces escape sequences.

However, in this particular case the special treatment is not actually necessary, since the \( and \) are not meaningful escape sequences in a Python string. When Python sees an invalid escape sequence, it just includes it literally (backslash and all). So you could also use "\(foo\)" (without the r prefix) and it will work just fine too.

But it's not generally a good idea to rely upon backslashes being ignored however, since if you edit the string later you might inadvertently add an escape sequence that Python does understand (when you really wanted the raw, un-transformed version). Since regex syntax has a number of its own escape sequences that are also escape sequences in Python (but with different meanings, such as \b and \1), it's a best practice to always write regex patterns with raw strings to avoid introducing issues when editing them.

Now to bring this around to the example code you've shown. I have no idea why you're using eval at all. As far as I can tell, you've mistakenly wrapped extra quotes around your regex patterns for no good reason. You're using exec to undo that wrapping. But because only the inner strings are using raw string syntax, by the time you eval them you're too late to avoid Python's string parsing messing up your literals if you have any of the troublesome escape sequences (the outer string will have already parsed \b for instance and turned it into the ASCII backspace character \x08).

You should tear the exec code out and fix your literals to avoid the extra quotes. This should work:

regex_list=[r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))',   # use raw literals, with no extra quotes!
            r'((?<=[\W_])(activation\ of\ %s)(?=[\W_]|$))'] # unnecessary backslashes?

sent='in this file we have the case of a foo(+) in the town'
gs1='foo'
for string_regex in regex_list:
    mo=re.search(string_regex %gs1,sent,re.I)    # no eval here!
    if mo:
        print(mo.group())

This example works for me (it prints foo(+)). Note that you've got some extra unnecessary backslashes in your second pattern (before the spaces). Those are harmless, but might be adding even more confusion to a complicate subject (regex are notoriously hard to understand).

Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • thanks for the detailed answer, I understood most of it. However, here is the catch. I am getting the `regex_list` from another file. In that file, I string together the regexes of the form`r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))'` and since they are strings, they are represented in the `regex_list` as `"r'((?<=[\W_])(%s\(\+\))(?=[\W_]|$))'"` i.e. with the extra quotes. Now I can easily remove these 2 extra quotes using string_regex[1:-1], but then the `r` remains as a literal `r`, and doesn't do its intended purpose. So how do I add the `r` at the beginning of the list elements as you have done? – user1993 Jun 17 '17 at 10:11
  • Well, your other code should not be using those extra quotation marks when it generates the strings. If it's dealing with text with lots of backslashes, it might want to use raw strings itself internally, but there's no reason for it to put raw string literals inside other strings. – Blckknght Jun 17 '17 at 19:43
  • ok then, let us suppose that I design regexes without the quotation marks, so like - `((?<=[\W_])(%s\(\+\))(?=[\W_]|$))`. So let's say all regexes in the `regex_list` are of this kind. Now I want to search text using these, and I want to use Python's raw string notation for it (i.e. prefacing the regex with a `r`). How can I do that? – user1993 Jun 21 '17 at 08:19
  • If you've already got a string, there's no need for a raw literal. Raw strings only exist in source code. After it's parsed, it's just a string. If you're reading a string from a file, it's just a string from the moment you get it. If you're getting it from another module, *maybe* that other module should use raw strings in building it up, but after they've done so, it's just a string and you don't need to do anything special with it. – Blckknght Jun 21 '17 at 11:44
  • *If you're getting it from another module*, yes that is what is happening. I send strings to a file which parses the string according to a grammar and creates a (different) string out of it which is later supposed to act as a regex. I have added an MCVE version of that regex-making code in the question. How do you suggest doing what you said-*maybe that other module should use raw strings in building it up*? – user1993 Jun 21 '17 at 12:49
  • You're already mostly doing what you should in that respect in the MCVE. Code like `tg[0]+r'?=.*\b'+tg[1:][:-1]+r'\b)'` is using raw string literals exactly how they are supposed to be used (to tell Python not to interpret `\b` itself). The only place you seem to be doing it differently is in the logic for for `PARA_STARTSWITH` where you're creating a nested string with a number at the end. If you don't need the number, I'd make it `lambda a: (r'(^' + a + r'(?=[\W_]|$))')`. The raw literals there are not strictly necessary, but good practice in case you add a `\b` or similar sequence later. – Blckknght Jun 21 '17 at 18:25
  • thank you for your comment. Actually the number at the end, in the case of both PARA_STARTSWITH and LINE_CONTAINS(had forgotten to add for this one earlier), has a special purpose - it is a marker for me, to know in post-processing of the regex, as to what are the parameters of the regex query. If the regex ends with 456, it means search in the line (as opposed to searching in the paragraph) etc.. and then I remove this ending part from the regexes. What difference is the number causing? – user1993 Jun 21 '17 at 18:49