I'm talking about content from inside a contenteditable div, and the target is the same contenteditable div. So no external programs involved.
The structure of the HTML in this div is that each individual word is inside a span with some data we need to track. Then the whitespace is left as text nodes between the spans. This works fine for the most part (screw you newlines) but I've encountered a strange problem when copy and pasting.
Chrome turns this
<span attrs="stuff">word</span> <span attrs="stuff">another</span>
into this:
<span attrs="stuff">word </span><span attrs="stuff">another</span>
or this:
<span attrs="stuff">word</span><span style="line-height: 16.79999"> </span><span attrs="stuff">another</span>
This obviously means that if the user copy and pastes over more than one line, then the formatting is completely screwed up, and the content of the span has changed which invalidates our data that we need to track.
The core problem is that other stuff in the div may contain non-breaking spaces for real reasons, so if I globally start swapping them out, then I might break that.
For my spans with my attrs, then I know what should be in them so it's easy to strip out the non-breaking spaces and restore it to how it should be. But for these strange spans with the odd line height, I've no idea how to clean them out without nuking everything.
Right now, I've stripped all the inserted spans that contain just a non-breaking space. But what I'd really like is to either stop Chrome from doing this in the first place, or an unambiguous means to identify the problematic extra spans so that I can clean them up in safety without breaking any similar spans that exist for real reasons. I could use this strange line-height I guess but that's pretty brittle and unsafe it feels.
How can I prevent the spans from appearing or identify them unambiguously?