I have a 129 MB CSV file with 849,275 rows and 18 columns. I'm trying to read the CSV file into a pandas DataFrame
using read_csv
.
When I use encoding='cp1252'
:
read_file = pd.read_csv('myfile.csv', encoding='cp1252')
The error is quite long but ultimately says this at the bottom:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 41:
character maps to <undefined>
When I specify: no encoding, encoding='utf-8'
, or encoding='utf-8-sig'
, I get:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 65:
invalid start byte
QUESTION:
I am fine with deleting these problematic characters altogether. Better yet would be to normalize them to ASCII characters under 127. How can I do this using JUST PANDAS? I'm looking for the most panda-like way if it exists.
Not to overkill this question but here's a list of the types of characters in one of the columns that I'm certain is causing the problem:
Character Ord
32
! 33
" 34
# 35
$ 36
% 37
& 38
' 39
( 40
) 41
* 42
+ 43
, 44
- 45
. 46
/ 47
0 48
1 49
2 50
3 51
4 52
5 53
6 54
7 55
8 56
9 57
: 58
; 59
< 60
= 61
> 62
? 63
@ 64
A 65
B 66
C 67
D 68
E 69
F 70
G 71
H 72
I 73
J 74
K 75
L 76
M 77
N 78
O 79
P 80
Q 81
R 82
S 83
T 84
U 85
V 86
W 87
X 88
Y 89
Z 90
[ 91
\ 92
] 93
^ 94
_ 95
` 96
a 97
b 98
c 99
d 100
e 101
f 102
g 103
h 104
i 105
j 106
k 107
l 108
m 109
n 110
o 111
p 112
q 113
r 114
s 115
t 116
u 117
v 118
w 119
x 120
y 121
z 122
{ 123
| 124
} 125
~ 126
129
143
157
160
¡ 161
¢ 162
£ 163
§ 167
¨ 168
© 169
« 171
¬ 172
® 174
° 176
± 177
² 178
³ 179
´ 180
µ 181
· 183
¸ 184
¹ 185
º 186
¼ 188
½ 189
¾ 190
× 215
ß 223
à 224
á 225
â 226
ã 227
ä 228
å 229
æ 230
ç 231
è 232
é 233
ì 236
í 237
î 238
ï 239
ð 240
ñ 241
ó 243
ô 244
ö 246
ú 250
û 251
ü 252
š 353
Ž 381
ƒ 402
– 8211
— 8212
‘ 8216
’ 8217
‚ 8218
“ 8220
” 8221
„ 8222
† 8224
• 8226
… 8230
‹ 8249
› 8250
€ 8364
™ 8482