-1

I have some invalid characters in my file that I'm trying to remove. But I ran into a strange problem with one of them.

When I try to use the replace function then I'm getting an error SyntaxError: EOL while scanning string literal.

I found that I was dealing with \x1d which is a group separator. I have this code to remove it:

import pandas as pd 

df = pd.read_csv('C:/Users/tkp/Desktop/Holdings_Download/dws/example.csv',index_col=False, sep=';', encoding='utf-8')

print(df['col'][0])

df = df['col'][0].encode("utf-8").replace(b"\x1d", b"").decode()
df = pd.DataFrame([x.split(';') for x in df.split('\n')])

print(df[0][0])

Output:

enter image description here

Is there another way to do this? Because it seems to me that I couldn't do it any worse this.

TylerH
  • 20,799
  • 66
  • 75
  • 101
Tomasz Przemski
  • 1,127
  • 9
  • 29
  • Does this answer your question? [How to remove special characers from a column of dataframe using module re?](https://stackoverflow.com/questions/33257344/how-to-remove-special-characers-from-a-column-of-dataframe-using-module-re) – Tomerikoo Feb 03 '21 at 10:28
  • Can you include a reproducible example? – Axe319 Feb 03 '21 at 11:32
  • It looks like you have some sort of character encoding issue. If I were a betting person, I'd bet that that strange character is supposed to be an "ö", so that the whole thing becomes "Coöperatiev" (seems to be common in Dutch). Could you check what byte values are actually in the corresponding line of the CSV file? – Ture Pålsson Feb 03 '21 at 15:53
  • What is so bad about the way you did it? Please explain what you consider a "better" solution to look like, in *objective* terms. – TylerH Feb 03 '21 at 17:20
  • @Ture Pålsson 10 bytes, including the tenth invisible. – Tomasz Przemski Feb 03 '21 at 17:40
  • @TylerH My point is that my solution seems clunky as in reality I am dealing with a much larger csv file and other nonprintable characters present. And the above method seems rather clunky, since you first need to use `encode()` to see the wrong characters at all, then `decode()`, and finally recreate the DataFrame. – Tomasz Przemski Feb 03 '21 at 17:46
  • Then it appears to be a matter of opinion (e.g. you're looking for 'elegance'). If this is the full script, you may have better luck asking on Code Review instead. – TylerH Feb 04 '21 at 14:41

2 Answers2

1

Notice that you are getting a SyntaxError. This means that Python never gets as far as actually running your program, because it can't figure out what the program is!

To be honest, I'm not quite sure why this happens in this case, but using "exotic" characters in string constants is always a bit iffy, because it makes you dependent on what the character encoding of the source code is, and puts you at the mercy of all sorts of buggy editors. Therefore, I would recommend using the '\uXXXX' syntax to explicitly write the Unicode number for the character you wish to replace. (It looks like what you have here is U+2194 DOUBLE ARROW, so '\u2194' should do it.)

Having said that, I would first verify that this is actually the problem, by changing the '↔' bit to something more mundane, like 'x' and seeing whether that causes the same error. If it does, then your problem is somewhere else...

Ture Pålsson
  • 6,088
  • 2
  • 12
  • 15
  • There is nothing in the csv file itself. While in xml and after pasting into the python editor, yes. I also noticed that this character is treated as a string spanning multiple lines because using a triple quotation marks takes some of the string to a new line. – Tomasz Przemski Feb 03 '21 at 11:11
0

You have to specify the encoding for which this character is defined in the charset.

df = df.replace('#', '', encoding='utf-8')
Decoder
  • 36
  • 3