I have a php (7.4) function which produces a large (200k lines, 10mb) csv file which internal users can download and manipulate.
If the user parses the file with python or some other script then they see all the data.
If the user opens Excel, selects import from csv, then specifies '65001: Unicode (UTF-8)' then they see all the data.
Unfortunately users click on the file in the download manager or file explorer and Excel tries to parse the file as '1252: Western European (Windows)' - in this case they get about 99% of the data with a few corrupt lines thrown in.
I want to remove the problem data at source and continue generating csv so as not to break downstream processes.
This python script identifies the problem records:
#! /usr/bin/env python3
import re
with open('file.csv') as f:
c = 0
d = 0
ct = {}
for l in f:
x = re.sub('[ -~]', '', l.rstrip())
if x != "":
d += 1
print("c:{} '{}' '{}'".format(c,x,l.rstrip()))
for y in x:
print(ord(y), hex(ord(y)), y.encode('utf-8'))
if ord(y) in ct:
ct[ord(y)] = ct[ord(y)] + 1
else:
ct[ord(y)] = 1
c += 1
print("Found {} problem lines out of {} total".format(d, c))
print(ct)
For example:
c:19286 '⿬⿬' '..."Come back an⿬~⿬ back"...'
226 0xe2 b'\xc3\xa2'
191 0xbf b'\xc2\xbf'
172 0xac b'\xc2\xac'
226 0xe2 b'\xc3\xa2'
191 0xbf b'\xc2\xbf'
172 0xac b'\xc2\xac'
How do I match and replace/remove ⿬
in php?
I'm trying the below, but it isn't doing anything:
$new = str_replace(
array(
"\u{00e2}",
"\u{00c3}\u{00a2}",
"\u{00e2}\u{00bf}\u{00ac}",
"\u{00c3}\u{00a2}\u{00c2}\u{00bf}\u{00c2}\u{00ac}",
),
array(
"#e2#",
"#c3a2#",
"#e2/bf/ac#",
"#c3a2/c2bf/c2ac#",
),
$text
);
EDIT:
Corrected the 226 code point to e2.
This didn't solve the issue.
⿬
is the representation from python, excel renders it as a white square box containing a question mark (not a diamond)