Python regex matchs not all characters wanted

Question

I have some txt-files made from pdfs and want so add some xml-tags using a little python-script and regex-patterns. Mostly it works fine but sometimes an expression matches not all the characters wanted. In the testing tool here it works right.

Here's the python-code:

matchs = re.finditer("<UTop>[^<]+",string)
    for m in matchs:
        tagend = m.end()
        string = string[:tagend] + "</UTop>" + string[tagend:]

The original string...

<Top>1. Regierungserklärung des Ministerpräsidenten<UTop>Ministerpräsident Winfried Kretschmann </Top>

... should be transformed to:

<Top>1. Regierungserklärung des Ministerpräsidenten<UTop>Ministerpräsident Winfried Kretschmann </UTop></Top>

but it returns

<Top>1. Regierungserklärung des Ministerpräsidenten<UTop>Ministerpräsident Winfried Krets</UTop>chmann </Top>

instead.

I would be glad to get a reply to that question. Jan

In case you are trying to parse HTML with regex, [see this](http://stackoverflow.com/a/1732454/4464702) — randers, Jan 11 '16 at 17:29

Jan · Answer 1 · 2016-01-11T18:04:08.750

1

Use the Unicode flag:

matchs = re.finditer("<UTop>[^<]+",string,re.UNICODE)

For HTML consider using BeautifulSoup instead.

edited Jan 11 '16 at 18:04

answered Jan 11 '16 at 17:39

Jan

42,290
8
54
79

1

+1 for BeautifulSoup. See also [docs](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers) for details regarding the handling of broken HTML input. – PeterE Jan 11 '16 at 17:44
Thanks for your answer. Unfortunately unicode flags didn't solve the problem. – Jan Seipel Jan 12 '16 at 21:09

Felipe · Accepted Answer · 2016-01-11T18:57:33.340

1

I test it using re.sub() and the result seems to be right.

 #coding: utf-8
 import re
 input = "<Top>1. Regierungserklärung des Ministerpräsidenten<UTop>Ministerpräsident Winfried Kretschmann </Top>"
 print(re.sub(r"(<UTop>[^<]+)","\g<1><\\UTop>" ,input))

As you said regex testing tools works properly too. here

edited Jan 11 '16 at 18:57

answered Jan 11 '16 at 18:22

Felipe

213
1
2
12

Python regex matchs not all characters wanted

2 Answers2