1

So I find my self in the need of converting some html text that comes from a database, and I get string similar to these:

<p style=\"font-size: 10px;\">\n<strong>Search for:<\/strong> <span style=\"color:#888888;\">2 to 15 People, \u00b120$ Per Person, Informal, Available on Date<\/span>\n<\/p>

And I need to put this in proper HTML. Something like this:

<p style="font-size: 10px;">
<strong>Search for:</strong> <span style="color:#888888;">2 to 15 People, &plusmn;20$ Per Person, Informal, Available on Date</span>
</p>

There are several issues here, first the slashes, i'm using stripcslashes before stripslashes so it first converts the C-style escapes like "\n". Then I use stripslashes to remove the quote escapes. But this messes up the unicode characters like the ± sign (\u00b1)

I've searched online and it seems using json decode is a trick usually used for this but I can't use json decode here because of the type of string I'm working with. This is just an example, the real strings I'm working with are full HTML pages.

Does anyone have any hints how I can tackle this?

This is what I'm currently using: Right now I'm using this:

$final = urlencode(stripslashes(stripcslashes(html_entity_decode($html, ENT_COMPAT, 'UTF-8'))));

It gets me an almost perfect HTML page, except for the unicode characters like \u00b1

SOLUTION

I ended up using the solution given by Lawrence Cherone

$new_html = str_replace(array('\"', '\/', '&quot;', '\n'), array('"', '/', '\'', "\n"), $old_html);

function unicode_convert($match){
   return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');                         }                          

$new_html = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', "unicode_convert", $new_html);
ItsJustMe
  • 121
  • 1
  • 6
  • Please show your processing code **as an edit of your question** (then delete your comments). Are we talking about the innerHTML, the tag attributes, or both? Please remove any "noise" from your sample string. We prefer a [mcve]. What is the exact data coming from the db, and what is your exact preferred output? – mickmackusa Sep 08 '20 at 22:27
  • Sorry, I've edited the main question with what I'm currently using. – ItsJustMe Sep 08 '20 at 22:30
  • I've edited the question with a more simple example and what the output should end up with – ItsJustMe Sep 08 '20 at 22:35
  • 1
    Why are the backslashes there in the first place? That's undoubtedly a bug somewhere. – Brad Sep 08 '20 at 22:35
  • Relevant: https://stackoverflow.com/a/7810880/362536 – Brad Sep 08 '20 at 22:35
  • @Brad, I believe it has to do with the code that is being used to save the HTML in the database. The problem is I can't change that. – ItsJustMe Sep 08 '20 at 22:38
  • Seems relevant: https://stackoverflow.com/a/2934602/156811 – Havenard Sep 08 '20 at 22:40
  • I suppose this is a job for `json_decode("\"$raw\"")`, why is the "type of string you're using" a problem here? – Havenard Sep 08 '20 at 22:45
  • Well when I use that I simply get a NULL string returned. – ItsJustMe Sep 08 '20 at 22:56
  • Is this data not properly stored in the database using a prepared statement? It feels like you need to repair scripting earlier in the flow. – mickmackusa Sep 08 '20 at 23:06

1 Answers1

2

If I understand you correctly you simply want to switch out: \" for " and \/ for /, and perhaps others.

You could use str_replace() and target more than \ in your lists of things to switch out.

Edit, using preg_replace_callback code from the following answer to fix unicodes: How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?

<?php
$str = '<p style=\"font-family: &quot;Open Sans&quot;, sans-serif; font-size: 10px; color: rgb(60, 170, 80); line-height: 150%; text-align: right; padding-top: 0px; padding-bottom: 0px; margin: 0px; overflow: hidden;\"><strong>Search for:<\/strong> <span style=\"color:#888888;\">2 to 15 People, \u00b120$ Per Person, Informal, Available on Date<\/span>\n<\/p>';

echo preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', function ($match) {
    return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE');
}, str_replace(['\"', '\/', '&quot;', '\n'], ['"', '/', '\'', "\n"], $str));

Result:

<p style="font-family: 'Open Sans', sans-serif; font-size: 10px; color: rgb(60, 170, 80); line-height: 150%; text-align: right; padding-top: 0px; padding-bottom: 0px; margin: 0px; overflow: hidden;"><strong>Search for:</strong> <span style="color:#888888;">2 to 15 People, ±20$ Per Person, Informal, Available on Date</span>
</p>

https://3v4l.org/fTtLh

Lawrence Cherone
  • 46,049
  • 7
  • 62
  • 106
  • And how would you go about converting the unicode characters to html entities? – ItsJustMe Sep 08 '20 at 22:37
  • 1
    np, your prob need add many more to str_replace as in your original snippet it had `"` and decoding it would make it `"` breaking the dom because of the one before it. sidenote you shouldn't really output anything as most things are not going to work if it has javascript or external resources not to mention XSS or frame busting stuff etc. – Lawrence Cherone Sep 08 '20 at 23:08
  • Due to this server using php 5.2 (i know i know) I'll need to convert that to a static function. I'll reply back in a bit. – ItsJustMe Sep 08 '20 at 23:10
  • It worked! I had to convert it to PHP 5.2 but it worked. `$new_html = str_replace(array('\"', '\/', '"', '\n'), array('"', '/', '\'', "\n"), $old_html); function unicode_convert($match){ return mb_convert_encoding(pack('H*', $match[1]), 'UTF-8', 'UCS-2BE'); } $new_html = preg_replace_callback('/\\\\u([0-9a-fA-F]{4})/', "unicode_convert", $new_html);` Thank you very much. – ItsJustMe Sep 08 '20 at 23:41