Python regex: having trouble understanding results

Question

I have a dataframe that I need to write to disk but pyspark doesn't allow any of these characters ,;{}()\\n\\t= to be present in the headers while writing as a parquet file.

So I wrote a simple script to detect if this is happening

import re
for each_header in all_headers:
  print(re.match(",;{}()\\n\\t= ", each_header))

But for each header, None was printed. This is wrong because I know my file has spaces in its headers. So, I decided to check it out by executing the following couple of lines

a = re.match(",;{}()\\n\\t= ", 'a s')
print(a)
a = re.search(",;{}()\\n\\t= ", 'a s')
print(a)

This too resulted in None getting printed.

I am not sure what I am doing wrong here.

PS: I am using python3.7

Your regex match those characters all together. Have you tried the token `[` and `]` `([,;{}\(\)\\n\\t=])` ? — Dorian Turba, May 22 '19 at 10:37
To the above two comments, we _don't_ need to double escape things like `\n` in a character class, just one backslash is sufficient. And `(){}` do not require any escaping at all. — Tim Biegeleisen, May 22 '19 at 10:38
`print((a.groups()))` You can see the words that are matching your pattern — shaik moeed, May 22 '19 at 10:38

score 2 · Accepted Answer · answered May 22 '19 at 10:35

The problem is that {} and also () are regex metacharacters, and have a special meaning. Perhaps the easiest way to write your logic would be to use the pattern:

[,;{}()\n\t=]

This says to match the literal characters which PySpark does not allow to be present in the headers.

a = re.match("[,;{}()\n\t=]", 'a s')
print(a)

If you wanted to remove these characters, you could try using re.sub:

header = '...'
header = re.sub(r'[,;{}()\n\t=]+', '', header)

Valdi_Bo · Answer 2 · 2019-05-22T10:52:28.763

1

If you want to check whether a text contains any of the "forbidden" characters, you have to put them between [ and ].

Another flaw in your regex is that in "normal" strings (not r-strings) any backslash should be doubled.

So change your regex to:

"[,;{}()\\n\\t= ]"

Or use r-string:

r"[,;{}()\n\t= ]"

Note that I included also a space, which you missed.

One more remark: {} and () have special meaning, but outside [...]. Between [ and ] they represent themselves, so they need no quotation with a backslash.

edited May 22 '19 at 10:52

answered May 22 '19 at 10:40

Valdi_Bo

30,023
4
23
41

Right. Thanks. I have edited the question to include the space – Clock Slave May 22 '19 at 15:32

score 1 · Answer 3 · answered May 22 '19 at 10:56

As already explained you could use regex for looking for forbidden characters, I want to add that you could do it without using regex following way:

forbidden = ",;{}()\n\t="
def has_forbidden(txt):
    for i in forbidden:
        if i in txt:
            return True
    return False
print(has_forbidden("ok name")) # False
print(has_forbidden("wrong=name")) # True
print(has_forbidden("with\nnewline")) # True

Note that using this approach you do not have to care about escaping special-regex characters, like for example *.

Python regex: having trouble understanding results

3 Answers3