1

I have a basic index.html file that I load with cheerio, modify the content of a tag, then rewrite the index.html.

My issue is that in that index.html, i have a tag in which the href has a '&' symbol, that is interpreted by cheerio as a '&' on load of the file. This leads to my index.html being rewritten with an incorrect href in this tag.

original :

<link href="https://fonts.googleapis.com/css2?family=Roboto:wght@300;400;500&display=swap" rel="stylesheet">

after being loaded by cheerio :

<link href="https://fonts.googleapis.com/css2?family=Roboto:wght@300;400;500&amp;display=swap" rel="stylesheet">

I know that it is normal for cheerio to output special characters as HTML entities, and that it's not an issue in most cases, but here it modifies an URL and so breaks it.

I read here that it was possible to get undecoded text to bypass this issue by setting decodeEntities: false on loading but it simply does nothing.

const $ = cheerio.load(data, { decodeEntities: false });

Any clue on how to force cheerio to not transform special characters into HTML entities?

1 Answers1

0

This may be a version difference. It appears that the decodeEntities option is supported by the htmlparser2 library that Cheerio uses in versions before 1.0, but not by the parse5 library that is its default for HTML in recent versions.

To use htmlparser2 for HTML, you can pass the xml option to have Cheerio select htmlparser2 instead of parse5, but then tell htmlparser2 that it shouldn't actually use its XML mode:

const $ = cheerio.load(data, { xml: { xmlMode: false, decodeEntities: false }});

If this solves the issue, you may want to add a comment with a link to the docs to explain what's going on here.

Wander Nauta
  • 18,832
  • 1
  • 45
  • 62