0

I have a dataframe that I need to write to disk but pyspark doesn't allow any of these characters ,;{}()\\n\\t= to be present in the headers while writing as a parquet file.

So I wrote a simple script to detect if this is happening

import re
for each_header in all_headers:
  print(re.match(",;{}()\\n\\t= ", each_header))

But for each header, None was printed. This is wrong because I know my file has spaces in its headers. So, I decided to check it out by executing the following couple of lines

a = re.match(",;{}()\\n\\t= ", 'a s')
print(a)
a = re.search(",;{}()\\n\\t= ", 'a s')
print(a)

This too resulted in None getting printed.

I am not sure what I am doing wrong here.

PS: I am using python3.7

Clock Slave
  • 7,627
  • 15
  • 68
  • 109

3 Answers3

2

The problem is that {} and also () are regex metacharacters, and have a special meaning. Perhaps the easiest way to write your logic would be to use the pattern:

[,;{}()\n\t=]

This says to match the literal characters which PySpark does not allow to be present in the headers.

a = re.match("[,;{}()\n\t=]", 'a s')
print(a)

If you wanted to remove these characters, you could try using re.sub:

header = '...'
header = re.sub(r'[,;{}()\n\t=]+', '', header)
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
1

If you want to check whether a text contains any of the "forbidden" characters, you have to put them between [ and ].

Another flaw in your regex is that in "normal" strings (not r-strings) any backslash should be doubled.

So change your regex to:

"[,;{}()\\n\\t= ]"

Or use r-string:

r"[,;{}()\n\t= ]"

Note that I included also a space, which you missed.

One more remark: {} and () have special meaning, but outside [...]. Between [ and ] they represent themselves, so they need no quotation with a backslash.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
1

As already explained you could use regex for looking for forbidden characters, I want to add that you could do it without using regex following way:

forbidden = ",;{}()\n\t="
def has_forbidden(txt):
    for i in forbidden:
        if i in txt:
            return True
    return False
print(has_forbidden("ok name")) # False
print(has_forbidden("wrong=name")) # True
print(has_forbidden("with\nnewline")) # True

Note that using this approach you do not have to care about escaping special-regex characters, like for example *.

Daweo
  • 31,313
  • 3
  • 12
  • 25