0

I have some txt-files made from pdfs and want so add some xml-tags using a little python-script and regex-patterns. Mostly it works fine but sometimes an expression matches not all the characters wanted. In the testing tool here it works right.

Here's the python-code:

matchs = re.finditer("<UTop>[^<]+",string)
    for m in matchs:
        tagend = m.end()
        string = string[:tagend] + "</UTop>" + string[tagend:]

The original string...

<Top>1. Regierungserklärung des Ministerpräsidenten<UTop>Ministerpräsident Winfried Kretschmann </Top>

... should be transformed to:

<Top>1. Regierungserklärung des Ministerpräsidenten<UTop>Ministerpräsident Winfried Kretschmann </UTop></Top>

but it returns

<Top>1. Regierungserklärung des Ministerpräsidenten<UTop>Ministerpräsident Winfried Krets</UTop>chmann </Top>

instead.

I would be glad to get a reply to that question. Jan

Jan Seipel
  • 117
  • 1
  • 8
  • 1
    In case you are trying to parse HTML with regex, [see this](http://stackoverflow.com/a/1732454/4464702) – randers Jan 11 '16 at 17:29

2 Answers2

1

Use the Unicode flag:

matchs = re.finditer("<UTop>[^<]+",string,re.UNICODE)

For HTML consider using BeautifulSoup instead.

Jan
  • 42,290
  • 8
  • 54
  • 79
  • 1
    +1 for BeautifulSoup. See also [docs](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) for details regarding the handling of broken HTML input. – PeterE Jan 11 '16 at 17:44
  • Thanks for your answer. Unfortunately unicode flags didn't solve the problem. – Jan Seipel Jan 12 '16 at 21:09
1

I test it using re.sub() and the result seems to be right.

 #coding: utf-8
 import re
 input = "<Top>1. Regierungserklärung des Ministerpräsidenten<UTop>Ministerpräsident Winfried Kretschmann </Top>"
 print(re.sub(r"(<UTop>[^<]+)","\g<1><\\UTop>" ,input))

As you said regex testing tools works properly too. here

Felipe
  • 213
  • 1
  • 2
  • 12