2

This is a follow up to my last question here. The answer posted there actually does not work. So here is the challenge. You are given this code (assume jQuery included):

<input type=text>
<script>
    $("input").val(**YOUR PHP / JS CODE HERE**);
</script>

Using jQuery - and not by injecting PHP output directly into the input tag - faithfully reproduce ANY text from the database in the input tag. If the database field says </script>, the field should say that too. If has Chinese in it, double quotes, whatever, reproduce that too. Assume your PHP variable is called $text.

Here are some of my failed attempts.

1)

$("input").val("<?= htmlentities($text); ?>");

FAILURE: Reproduces character encoding exactly as is in text fields.
INPUT: $text = "Déjà vu"
OUTPUT: Field contains literal d&eacute;j&agrave; vu

2)

$("input").val(<?= json_encode($text); ?>);

This was suggested as the answer in my last question, and I naively accepted it. However...
FAILURE: json_encode only works with UTF-8 characters.
INPUT: $text = "Va e de här fö frågor egentlien"
OUTPUT: Field is blank, because json_encode returns null.

3)

var temp = $("<div></div>").html("<?= htmlentities($text); ?>");
$("input").val(temp.html());

This was my most promising solution for the weird characters, except...
FAILURE: Does not encode some characters (not sure exactly which, don't care)
INPUT: $text = "</script> Déjà"
OUTPUT: Field contains &lt;/script&gt; Déjà

4) Suggested in answers

$("input").val(unescape("<?= urlencode($text); ?>"));

FAILURE: Spaces remain encoded as +'s.

$("input").val(unescape(<?= rawurlencode($text); ?>"));

Almost works. All previous input succeeds, but multibyte stuff, like kanji, remain encoded. decodeURIComponent also doesn't like multibyte characters.

Note that for me, things like strip_tags are not an option. Everything must be allowed. People are authoring quizzes with this, and if someone wants to make a quiz that tests your knowledge of HTML, so be it. Also, unfortunately I cannot just inject the htmlentities escaped text into the value field of the input tags. These tags are generated dynamically, and I would have to totally tear down my current javascript code structure to do it that way.

I feel like I'm SOL here. Please show me how wrong I am.

EDIT

Assume the user initally entered </script> Déjà här fö frågor 漢字 into the db. This would be stored (you would see it in phpMyAdmin) as </script> Déjà här fö frågor &#28450;&#23383;

Community
  • 1
  • 1
Tesserex
  • 17,166
  • 5
  • 66
  • 106
  • 1
    I don't understand what's wrong with solution #3? Why do you care that certain characters are not encoded, as long as the browser handles them properly? – Dolph Jul 02 '10 at 01:18
  • 1
    you should have not post another question if the previous was not solved... – Reigel Gallarde Jul 02 '10 at 01:20
  • because users will type ` – Tesserex Jul 02 '10 at 01:22
  • Why not just use a rich edit control instead of plaintext? – Anon. Jul 02 '10 at 01:25
  • Give an example of a value of `$text` for which `$("input").val(unescape(= rawurlencode($text); ?>"));` fails. – Artefacto Jul 02 '10 at 02:09
  • `$text = " Déjà här fö frågor 漢字"`. That's what the database stores. So technically it's correct in reproducing what the db sees. But it needs to re-encode the last two multibyte characters into kanji. – Tesserex Jul 02 '10 at 02:23

6 Answers6

1

You need to encode in PHP, and decode in JavaScript...

PHP's rawurlencode():

echo rawurlencode("</script> Déjà");
//result: %3C%2Fscript%3E+D%C3%A9j%C3%A0

JavaScript's decodeURIComponent():

var encoded = "%3C%2Fscript%3E+D%C3%A9j%C3%A0";
alert(decodeURIComponent(encoded));
//result: </script> Déjà
Dolph
  • 49,714
  • 13
  • 63
  • 88
  • 1
    You could just use rawurlencode instead of urlencode and then you wouldn't have to replace the plus signs manually. – Artefacto Jul 02 '10 at 01:25
  • Nice, I tested that and it seemed to work. Simplified my answer! – Dolph Jul 02 '10 at 01:27
  • Sorry, fails! With some input (including multibyte chars) javascript complains of malformed uri component. – Tesserex Jul 02 '10 at 01:35
  • @Tess What about converting the text to UTF-8 before encoding? You said "accurately". This gives an accurate representation of the bytestream. – Artefacto Jul 02 '10 at 01:37
  • If you want more/better answers to this question, you're going to have to provide unit tests. – Dolph Jul 06 '10 at 02:05
1

What encoding is your text in, if not UTF-8? If you don't know, you don't have text, you have a byte sequence, which is much harder to faithfully represent. If you do know, you can do something like this using the PHP multibyte string extension:

$("input").val(<?= json_encode(mb_convert_encoding($text, "UTF-8", "ISO-8859-1")); ?>);

Here I've presumed your input is in ISO-8859-1 aka Latin-1 encoding, which is a pretty common case for database strings.

EDIT: This is in response to the comments about a closing script tag. I made this test file and it displays properly for me, at least in Firefox 3.6:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
    <title>Test</title>
    <script src='http://code.jquery.com/jquery-1.4.2.js'></script>
</head>
<form name='foo'>
    <input name='bar' id='bar'/>
</form>
<script language="JavaScript">
    $('input').val("<\/script>");
</script>
</html>
Walter Mundt
  • 24,753
  • 5
  • 53
  • 61
  • this is the right answer, you need to normalise to UTF8 first and it'll work fine – nathan Jul 02 '10 at 02:01
  • i've been trying this and it still fails for closing script tags. They get turned into `<\/script>`, but that still breaks the js. Also the multibyte characters still don't get converted back. Can you explain "normalize to UTF-8"? – Tesserex Jul 02 '10 at 02:07
  • If I combine this answer with my #3, it doesn't break, but the script tags don't show up at all. – Tesserex Jul 02 '10 at 02:15
  • Does it work if you just put that in the code directly? This works for me in Firefox in a test HTML file with jQuery 1.4.2: `$('input').val("<\/script>");` -- does that code not fill the box for you? – Walter Mundt Jul 02 '10 at 13:43
  • If I use this as is, multibyte characters are not shown correctly. Try it with the full input string I gave at the end of my edit. – Tesserex Jul 02 '10 at 17:40
1

I have found a "good enough" solution that you all might find interesting.

  1. utf8_encode the string on the way into the database. This makes sure that it can be safely handled on the way out by the following steps.

2.

function repl($match)
{
    return "\u" . dechex($match[1]);
}

function esc($string)
{
    $s = json_encode($string);
    $s = preg_replace_callback("/&#([0-9]+);/", "repl", $s);
    return $s;
}

This isn't absolutely perfect, because there doesn't seem to be any way for the php to know the difference between the user typing 漢 or literally typing &#28450;. So if you type the latter it will become the former. But I doubt anyone will ever want to do that anyway.

Tesserex
  • 17,166
  • 5
  • 66
  • 106
0

safe javascript escaping for ascii strings.

<?php
function js_encode($string)
{
    $cleaned = is_null($string) ? null : '';

    // for each letter of the string
    for ($i=0, $len = strlen($string); $i < $len; $i++)
    {
        // get ascii number
        $ord = ord($string[$i]);
        // if [0-9] or [A-Z] or [a-z]
        $cleaned .= (47 < $ord && $ord < 58 OR 64 < $ord && $ord < 91 OR 96 < $ord && $ord < 123)
            // use existing character
            ? $string[$i]
            // otherwise escape it
            : '\x'.dechex($ord);
    }

    return $cleaned;
}

for unicode text it is a little more complicated, I am going to start with this and see if I need to do the more complex version.

Scott Jungwirth
  • 6,105
  • 3
  • 38
  • 35
0

You may want to use urlencode() and urldecode().

user268396
  • 11,576
  • 2
  • 31
  • 26
0

You can use:

Artefacto
  • 96,375
  • 17
  • 202
  • 225