Split string with accent and query without accent

Question

I want to split a string with accent with a query without accent .

This is my code for the moment :

const sanitizer = (text: string): string => {
  return text
    .normalize("NFD")
    .replace(/\p{Diacritic}/gu, "")
    .toLowerCase();
};

const splitter = (text: string, query: string): string[] => {
  const regexWithQuery = new RegExp(`(${query})|(${sanitizer(query)})`, "gi");

  return text.split(regexWithQuery).filter((value) => value);
};

And this is the test file :

import { splitter } from "@/utils/arrayHelpers";

describe("arrayHelpers", () => {
  describe("splitter", () => {
    const cases = [
      {
        text: "pepe dominguez",
        query: "pepe",
        expectedArray: ["pepe", " dominguez"],
      },
      {
        text: "pépé dominguez",
        query: "pepe",
        expectedArray: ["pépé", " dominguez"],
      },
      {
        text: "pepe dominguez",
        query: "pépé",
        expectedArray: ["pepe", " dominguez"],
      },
      {
        text: "pepe dominguez",
        query: "pe",
        expectedArray: ["pe", " pe", " dominguez"],
      },
      {
        text: "pepe DOMINGUEZ",
        query: "DOMINGUEZ",
        expectedArray: ["pepe ", "DOMINGUEZ"],
      },
    ];

    it.each(cases)(
      "should return an array of strings with 2 elements [pepe, dominguez]",
      ({ text, query, expectedArray }) => {
        // When I call the splitter function
        const textSplitted = splitter(text, query);

        // Then I must have an array of two elements
        expect(textSplitted).toStrictEqual(expectedArray);
      }
    );
  });
});

The problem is with the second case :

{
  text: "pépé dominguez",
  query: "pepe",
  expectedArray: ["pépé", " dominguez"],
}

because the sanitized query pepe is also pepe, so not in Pépé dominguez. I don't know how to achieve in this case to make the splitter function return ['pépé', 'dominguez'].

I'm looking for a result with the original text , not a sanitize Text

Usually you don't remove dialects, but replace them with other letters. E.g. `.replace('é', 'e')`. https://stackoverflow.com/questions/286921/efficiently-replace-all-accented-characters-in-a-string — Justinas, Oct 27 '21 at 08:14
the sanitize funciton does this job I think. But I dont want to sanitize the result — Adri HM, Oct 27 '21 at 08:18

MauriceNino · Accepted Answer · 2021-10-27T08:36:28.840

The only option that comes to my mind is to keep a map of possible options for your letters and then building the query dynamically:

// Get query with each letter being one of its options
const sanitizeQuery = (query) => {
  const sanitizerMap = {
   'e': ['é']
  }

  return query
    .split('')
    .map(l => 
      sanitizerMap[l] !== undefined 
        ? `(?:${l}|${sanitizerMap[l].join('|')})` 
        : l
    )
    .join('');
}

// Split text by a sanitzed query
const splitter = (text, query) => {
  const regexWithQuery = new RegExp(`(${sanitizeQuery(query)})`, "gi");

  return text.split(regexWithQuery).filter((value) => value);
};

// Test
const query = 'pepe';
console.log('Query Regex:', sanitizeQuery(query));
console.log('Output:', splitter('pépé dominguez', query));

You can optimized this, by putting the options for the letters in a string instead of an array.

Hint: ?: in the regex means that the result will not be captured. If not used, every single letter that matches will be in the output array. Read more about it here: What is a non-capturing group in regular expressions?

Split string with accent and query without accent

1 Answers1