Text changes when copied from word document to web page

Question

I am creating a blog engine and it includes a <textarea> which takes in the input of the whole article.

I then use ajax and store it to the Text variable provided by the GAE datastore

The Problem: If a user copies the text from a word document, them I see various random characters on the screen when embedded on the web page. I know this is because the word file uses XML encoding and a HTML page uses utf-8 encoding(in my case)

The question: How do I change the encoding of the inputted text? Or how can I avoid the XML encoding? Or if changing the encoding of my web page might help solve this problem?

Points to be noted: I want to make it automated.. I have read on Google that you should 1st copy the text to some simple text editor which formats the encoding and them copy it to the web page. But this option is not feasible for me.

Also I have used weebly before, and that time I copied text from a word file, if someone knows how weebly manages the encoding conflict!

Answers are expected in java :)

shreyansh jogi · Accepted Answer · 2013-10-12T10:20:36.820

1

that is because word documment ' (comma) are not covered in UTF - 8 standards so you need to handle it in programmatic way.

below is some example on javascript

<textarea rows="4" onkeyup="replaceWordChars(this.value)" cols="50">
//your text area
</textarea> 


function replaceWordChars(text) {
    var s = text;
    // smart single quotes and apostrophe
    s = s.replace(/[\u2018|\u2019|\u201A]/g, "\'");
    // smart double quotes
    s = s.replace(/[\u201C|\u201D|\u201E]/g, "\"");
    // ellipsis
    s = s.replace(/\u2026/g, "...");
    // dashes
    s = s.replace(/[\u2013|\u2014]/g, "-");
    // circumflex
    s = s.replace(/\u02C6/g, "^");
    // open angle bracket
    s = s.replace(/\u2039/g, "<");
    // close angle bracket
    s = s.replace(/\u203A/g, ">");
    // spaces
    s = s.replace(/[\u02DC|\u00A0]/g, " ");
    document.getElementById("your Textarea ID ").value = s;
}

on text area you need to fire this javascript function onKeyup event

edited Oct 12 '13 at 10:20

answered Oct 12 '13 at 09:58

shreyansh jogi

2,082
12
20

Does your above code handle all the conflicts? Or is it just an example? If example then where can I get a complete list? – leo Oct 12 '13 at 10:06
most probably it will handle all that are different in standards. give a try and let me know your output – shreyansh jogi Oct 12 '13 at 10:07
you just need to call this function rest of things this function will do – shreyansh jogi Oct 12 '13 at 10:12
and m a little dumb when it comes to javascript. so in your code, 'text' is the variable in which whole article is? and what does flag have a role in this? and all in all.. the final converted variable is 's'? right? – leo Oct 12 '13 at 10:16

score 0 · Answer 2 · edited May 23 '17 at 12:14

Not sure if this will help anyone, but I spent a few days trying to figure out this issue. My use case was very similar except I discovered my problem related to the way the clipboard copied (this changed slightly depending upon OS) and subsequently pasted the text. (I used ClipSpy to investigate what was happening "under the hood".)

Forgive my layman's explanation: The clipboard stores text in multiple formats and when the paste command is given it attempts to match the charset/encoding of the recipient program, or in my case <textarea> box of my webpage. These sites and forum posts helped immensely:

Ultimately all I had to do was declare early on <head> <meta charset="UTF-8"> </head> let the browser do the "hard" work for me, by expecting UTF-8 encoded text and the clipboard attempts to honour that.

Text changes when copied from word document to web page

2 Answers2