1

I am reading a csv file in node.js that contains urls, I want to be able to detect when a string contains this character � or any other character that is not the proper UTF8 symbol.

Wrong URL I want to be able to detect:

'https://example.com/v�hicules-de-location/france'

Right URL I want to ignore

'https://example.com/véhicules-de-location/france'

Is there an easy way with JavaScript to do that?

Álvaro
  • 2,255
  • 1
  • 22
  • 48
  • 2
    Usually, when you get that character, you're reading the file using a wrong encoding. There is not a proper way to understand what the single character originally was. Instead, you should use the correct encoding when reading the file, and everything else will work fine as it is. – Gregorio Palamà Nov 06 '21 at 14:53
  • 1
    I know, but that's not my question, I know they uploaded the wrong encoded file, that's why I want to be able to detect it – Álvaro Nov 06 '21 at 14:59
  • I do not control the file, the symbol already comes like that written on the csv so there is no encodign mechanism that is going to turn a `�` into a `é` – Álvaro Nov 06 '21 at 15:08
  • 1
    The question is how you read the file. Can you post that code as well? – gaitat Nov 06 '21 at 15:11
  • 1
    No, the question is how to detect a wrong UTF8 character – Álvaro Nov 06 '21 at 15:12
  • @gaitat it does not really matter how I read it if the file contains � anyway before I even read it right? – Álvaro Nov 06 '21 at 15:14
  • 1
    Yes you fixed the issue by introducing (in my opinion, ugly code). In addition you dont know if tomorrow a different undesirable character will appear in your input file. – gaitat Nov 06 '21 at 15:20
  • Try `Readable.setEncoding('utf8')` – gaitat Nov 06 '21 at 15:21
  • @gaitat but again what good is that going to do if the file already comes with �? – Álvaro Nov 06 '21 at 15:27
  • Prefer the solution of @MaReAL below – gaitat Nov 06 '21 at 20:59

1 Answers1

-2

The file comes with the � already in there, I can see it when I open it with a text editor, it's already there once the � is there, there is nothing it can be done to encode it properly.

In the end I am doing this, cause the 'corrupted' character comes like that already and is not something I can control.

....

import csv from 'csv-parser'
import { Readable } from 'stream'

export default async (buffer) => {
  const json = []

  return new Promise((resolve, reject) => {
    Readable.from(buffer)
      .pipe(csv())
      .on('data', (data) => {
        const corrupted = Object.values(data).some((entry) => /�/.test(entry))

        if (corrupted) reject('encoding')

        json.push(data) 
      })
      .on('end', async () => {
        resolve(writeFile ....)
      })
  })
}
Álvaro
  • 2,255
  • 1
  • 22
  • 48