0

In my input data, it seems when the logger with which the data were collected runs on low power, it introduces some special characters, like "v" or "@".

I loaded the dataset with pandas read_csv(engine='python'). I had to use engine='python' otherwise I got an error message:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 2: invalid start byte

I found how to exclude special characters from a pandas dataframe:

df.var_a = df.var_a.str.replace('[#,@,&,�,{,v,?]','')

This replaces special characters like "@" with "", but I have also this special character in the dataframe "\x" and I can't remove it with the code shown above, I get this error message:

SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 24-25: truncated \xXX escape

How can I remove "\x" from a pandas dataframe?

eetiaro
  • 342
  • 1
  • 3
  • 14
  • Are you going to remove Unicode sequence ? Use `\X` that matches any valid Unicode sequence, including line breaks ? (that is, `\X` instead of `\x`) ? – SeaBean Jun 16 '21 at 08:11
  • Does this answer your question? [Confused about backslashes in regular expressions](https://stackoverflow.com/questions/33582162/confused-about-backslashes-in-regular-expressions) – smci Jun 16 '21 at 10:59
  • @smci The issue is not about missing back slash but on the wrong use of regex meta-charactrers. – SeaBean Jun 16 '21 at 11:01
  • Ok @SeaBean but then the Unicode is due to reading in the dataset with the wrong `pd.read_csv(... encoding=...)`. That's also a known duplicate. Can you please find the dupe target and close as duplicate? – smci Jun 16 '21 at 11:03
  • @smci According to OP's comment, the issue is also not caused by wrong encoding used in read_csv. It is owing to: the logger with which the data were collected runs on low power it introduces some special characters, – SeaBean Jun 16 '21 at 11:05
  • 2
    eetiaro and @SeaBean: that hugely important point is buried in a comment hidden under SeaBean's answer. It needs to be in the question. eetiaro please edit your question to add that missing information (don't put important stuff deep in comments; they don't get seen, searched or indexed, and often get deleted (ephemeral)). If not, any pandas user would take it it was the well-known "read in dataset using wrong encoding". – smci Jun 16 '21 at 11:14

2 Answers2

2

Note that \x is for matching hexadecimal character in format \xYY e.g. \x20.

If you want to remove Unicode sequence that appears as special characters to you, you can use \X (capital letter X instead of lower case letter x), as follows:

df.var_a = df.var_a.str.replace(r'\X', '', regex=True)   # capital X instead of lower case x

If you want to remove hexadecimal values, you have to specify the specific values, e.g. to remove hex value \x20, you can use:

df.var_a = df.var_a.str.replace(r'\x20', '', regex=True)

Probably you can't remove the hex values in a range. You have to specify each value one by one to remove it, though you can try whether removing the Unicode sequence with \X can achieve the effect you want.

Side note to your existing regex:

Note that you don't need comma , to separate the special characters in the regex character class.

Another point to note is that besides symbols in your list of characters to exclude, there is a character that looks like v in your list. Are you sure you want to remove the letter v ? Or that v is a special symbol incorrectly displayed here ?

Edit

If you want to check which elements are not numeric, you can use .str.isnumeric() to check it:

df['var_a'].loc[~df['var_a'].str.isnumeric()]

Edit2

If you are sure characters other than floating point characters should be removed, you can use the following to remove characters other than 0-9 and .:

df.var_a = df.var_a.str.replace(r'[^0-9.]', '', regex=True)
SeaBean
  • 22,547
  • 3
  • 13
  • 25
  • Thank you for your detailed answer. Basically, I want to convert a column to float. It seems when the logger with which the data were collected runs on low power it introduces some special characters, like "v". I only want to have numbers. Your suggested solution creates no error message, but when I try afterwards to convert the column to float I still get: " ValueError: could not convert string to float:" and it does not show me which string could not be converted. How could I find the strings that can't be converted? – eetiaro Jun 16 '21 at 10:35
  • @eetiaro See my edit above for the way to find the non-numeric string entries. – SeaBean Jun 16 '21 at 10:46
  • @eetiaro See my second edit that helps you clean up non-floating point characters. To play safe, you can use my first edit to have a look at the non-numeric characters first before you clean them up with the code in 2nd edit. – SeaBean Jun 16 '21 at 10:58
  • @eetiaro How's going with your non-numeric fields checking ? Are my 2 edits helpful ? – SeaBean Jun 16 '21 at 14:57
0

Try this one:

df.var_a = df.var_a.str.replace(r'\\x', '')
hmars
  • 147
  • 1
  • 2
  • 13
  • You don't need double slashes with raw-string – SeaBean Jun 16 '21 at 08:24
  • It executes without error message, but when I try afterwards: df.var_a.astype(float) I get this error message ValueError: could not convert string to float: '\x11596558.00' – eetiaro Jun 16 '21 at 10:25