Remove unicode symbols from csv so Excel can import on double click

Question

I have a php (7.4) function which produces a large (200k lines, 10mb) csv file which internal users can download and manipulate.

If the user parses the file with python or some other script then they see all the data.

If the user opens Excel, selects import from csv, then specifies '65001: Unicode (UTF-8)' then they see all the data.

Unfortunately users click on the file in the download manager or file explorer and Excel tries to parse the file as '1252: Western European (Windows)' - in this case they get about 99% of the data with a few corrupt lines thrown in.

I want to remove the problem data at source and continue generating csv so as not to break downstream processes.

This python script identifies the problem records:

#! /usr/bin/env python3

import re

with open('file.csv') as f:
    c = 0
    d = 0
    ct = {}
    for l in f:
        x = re.sub('[ -~]', '', l.rstrip())
        if x != "":
            d += 1
            print("c:{} '{}' '{}'".format(c,x,l.rstrip()))
            for y in x:
                print(ord(y), hex(ord(y)), y.encode('utf-8'))
                if ord(y) in ct:
                    ct[ord(y)] = ct[ord(y)] + 1
                else:
                    ct[ord(y)] = 1
        c += 1
print("Found {} problem lines out of {} total".format(d, c))
print(ct)

For example:

c:19286 'â¿¬â¿¬' '..."Come back anâ¿¬~â¿¬ back"...'
226 0xe2 b'\xc3\xa2'
191 0xbf b'\xc2\xbf'
172 0xac b'\xc2\xac'
226 0xe2 b'\xc3\xa2'
191 0xbf b'\xc2\xbf'
172 0xac b'\xc2\xac'

How do I match and replace/remove â¿¬ in php?

I'm trying the below, but it isn't doing anything:

        $new = str_replace(
            array(
                "\u{00e2}",
                "\u{00c3}\u{00a2}",
                "\u{00e2}\u{00bf}\u{00ac}",
                "\u{00c3}\u{00a2}\u{00c2}\u{00bf}\u{00c2}\u{00ac}",
            ),
            array(
                "#e2#",
                "#c3a2#",
                "#e2/bf/ac#",
                "#c3a2/c2bf/c2ac#",
            ),
            $text
        );

EDIT: Corrected the 226 code point to e2. This didn't solve the issue. â¿¬ is the representation from python, excel renders it as a white square box containing a question mark (not a diamond)

Seems like you've just gotten the wrong codepoints. Is [this](https://3v4l.org/CVf9Q) what you had in mind? — El_Vanja, Mar 22 '21 at 16:49
Sadly not, I think php is seeing a single character, but so far I've failed to reverse engineer it from what python is telling me — ChrisG, Mar 22 '21 at 17:27
*I want to remove the problem data at source*: [create files with BOM](https://stackoverflow.com/a/9047876/3439404) which Excel understands… — JosefZ, Mar 22 '21 at 17:51

Remove unicode symbols from csv so Excel can import on double click

0 Answers0