3

I'm talking about content from inside a contenteditable div, and the target is the same contenteditable div. So no external programs involved.

The structure of the HTML in this div is that each individual word is inside a span with some data we need to track. Then the whitespace is left as text nodes between the spans. This works fine for the most part (screw you newlines) but I've encountered a strange problem when copy and pasting.

Chrome turns this

<span attrs="stuff">word</span> <span attrs="stuff">another</span>

into this:

<span attrs="stuff">word&nbsp;</span><span attrs="stuff">another</span>

or this:

<span attrs="stuff">word</span><span style="line-height: 16.79999">&nbsp;</span><span attrs="stuff">another</span>

This obviously means that if the user copy and pastes over more than one line, then the formatting is completely screwed up, and the content of the span has changed which invalidates our data that we need to track.

The core problem is that other stuff in the div may contain non-breaking spaces for real reasons, so if I globally start swapping them out, then I might break that.

For my spans with my attrs, then I know what should be in them so it's easy to strip out the non-breaking spaces and restore it to how it should be. But for these strange spans with the odd line height, I've no idea how to clean them out without nuking everything.

Right now, I've stripped all the inserted spans that contain just a non-breaking space. But what I'd really like is to either stop Chrome from doing this in the first place, or an unambiguous means to identify the problematic extra spans so that I can clean them up in safety without breaking any similar spans that exist for real reasons. I could use this strange line-height I guess but that's pretty brittle and unsafe it feels.

How can I prevent the spans from appearing or identify them unambiguously?

Puppy
  • 144,682
  • 38
  • 256
  • 465
  • 1
    Do you have an example that you could post? I don't quite understand what you're trying to do. Are you parsing the contenteditable div and wrapping words with tags? – Carl Reid Mar 17 '15 at 09:32
  • does [html encode/decode](http://stackoverflow.com/a/1219983/724913) solve your problem – arkoak Mar 17 '15 at 09:41
  • @Yoink: No, I inserted content through JS into the div. Then the user copies and pastes it. It's a rich text box so it may contain quite a few strange things that the user could try to copy and paste as well, so I can''t make assumptions about what is incoming unless it's specifically tagged as mine. – Puppy Mar 17 '15 at 09:47

1 Answers1

0

The problem is not a Chrome problem only. All the time you copy HTML Code somewhere something like this can happen.

This is why you can use editors like CKEditor. They have advanced filter techniques to remove such bad HTML code.

I recommend to use a clipboard program to see how the HTML code is when you copy from different places: https://softwarerecs.stackexchange.com/questions/17710/see-clipboard-contents-hex-text

But implementing this on your own would be a waste of time in my opinion.

CKEditor can be configured very well to prevent the bad HTML code.

Recent versions of CKEditor have a very sophisticated content filtering approach. It is called "Advanced Content Filter".

Basically "Advanced Content Filter" means: The whole HTML code gets parsed or checked. In the case that there is no rule which matches to the given HTML code, it gets filtered out.

Community
  • 1
  • 1
Matthias
  • 1,386
  • 3
  • 24
  • 59