1

I need to find a way on calculating the position (that means the index of the character position) of text located in a div-element into the according position of HTML-Code. This is necessary because I must be able to insert an element at the position without losing the formatting.

For example I have the following:

HTML

<p>Lorem <strong>ipsum</strong> dolor sit...</p>

which is interpreted to:

Text

Lorem ipsum dolor sit...

Now I would like to insert a string-element at a specific position inside the Text:

Lorem ipsum d<insertion>olor sit...

This is at string-index: 13

Regarding to that the position of the insertion should be 32 in my HTML, because there are HTML-Tags: <p>, <strong>, </strong> which must be also counted to find the correct position inside the HTML.

I only have those informations:

  • text as string (that means without any tags)
  • text as HTML
  • index of text-string where the insertion has to be placed (it's the 13 in my example)

The solution should be in Python. I played around with the BeautifulSoap module, but didn't find a way to insert text at a specific index inside an element.

Hope someone can help me with this. Many thanks in advance!

noplacetoh1de
  • 219
  • 3
  • 12

1 Answers1

3

As I get from your question, you want to insert something in HTML code behind a letter of which you know the index in plaintext. If that is the case, I think the easyest solution would be to ignore all the html tags and count only the letters outside them. You could do it like this:

def insertInHtml(string, insstr, position):
    ctr=0
    insidetag=False
    for ci in range(len(string)):
        if string[ci]=='<':
            insidetag=True
        elif string[ci]=='>':
            insidetag=False
        else:
            if not insidetag: ctr+=1
        if ctr==position+1:
            HTMLIndex=ci
            break
    return string[0:HTMLIndex] + insstr + string[HTMLIndex:]

The function counts the number of characters in HTML string that is passed to the function as the 'string' argument that aren't inside HTML tags. When you hit the number you passed to the function as the 'position' argument, the counting loop will break and the function will split the string behind the letter on the position you specified. It will then insert the insstr string between those parts and return the new string. It will raise an error if provided index is greater than the length of the text.

EDIT: As J. F. Sebastian noted, this will fail if html has comments (lines starting with <, exclamation point and two dashes) or literal < in an attribute. Here is the function that handles both cases:

def insertInHtml(string, insstr, position):
    ctr=0
    insidetag=False
    insideattr=False
    for ci in range(len(string)):
        suchar=''
        if not ci==len(string)-1: suchar=string[ci+1]
        if string[ci]=='<' and not insideattr:
            insidetag=True
        elif (string[ci]=='>' and not insideattr) or (string[ci]=='-' and string[ci-2:ci]=='!-'):
            insidetag=False
        elif insidetag and string[ci]+suchar=='="':
            insideattr=True
        elif insideattr and string[ci]=='"':
            insideattr=False
        else:
            if not insidetag: ctr+=1
        if ctr==position+1:
            HTMLIndex=ci
            break
    return string[0:HTMLIndex] + insstr + string[HTMLIndex:]

Not very clean code, but should be understandable enough.

Fran Borcic
  • 696
  • 6
  • 8
  • it fails if an attribute has literal `>` in it or if there is a comment. See [Is “>” (U+003E GREATER-THAN SIGN) allowed inside an html-element attribute value?](http://stackoverflow.com/questions/94528/is-u003e-greater-than-sign-allowed-inside-an-html-element-attribute-value). – jfs Nov 15 '12 at 16:10
  • Ok, didn't really think case like that was possible. I'll edit the function to ignore that literal if it is inside quotation marks or followed by an exclamation point and two dashes – Fran Borcic Nov 15 '12 at 16:17