I hate to sound negative, but this is so hard it's ridiculous. You have to deal with IE and others, and the implementations are vastly different. But where it gets uber-hard is that if you click a button to insert the image, you lose focus and the caret position, so you need to remember the position with some onblur bookmarking ability (again, IE different). The focus thing is not so much an issue if your editablecontent is in an iframe and maintains its own focus. (Note: not dissing IE here, I actually prefer their implementation to the W3C standard drek.)
You can look at some open source text editors for clues and hints. But you'll find an enormous amount of code to handle these simple tasks.