Regular expression in Python and re.split splitting the wrong thing

Question

I've been trying to organize a text using Python but my attempt at using re.split is not working, even if my regular expression is good (I've tried it on notepad++).

I need to split my text using the regular expression (and keep what has been found) but the text is being split caracter by caracter.

texttag is a txt file that looks like this :

<word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'                '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'    '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>

And i'm trying to split the

<word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>

and i'm trying to split and tag it in such a manner :

<chap1>
<head><word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
</head>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'                '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'    '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>
</chap>

here is my whole code for now :

Dumas_XML=open("D:/cours/M1/S2/PTpython/GitHub/test2.txt","a") #C:/Users/super/Desktop/PTpython/GitHub/
#puverture du header xml
Dumas_XML.write('<?xml version="1.0" encoding="UTF-8"?>\n')
Dumas_XML.write('<Doc name="DUMAS" path="C:/Users/super/Desktop/PTpython/GitHub/textes"></Doc>\n') |6
Dumas_XML.write('<Document num="1" taille= "nombre de mots int()"/> </Document> \n ')

filetag = open("D:/cours/M1/S2/PTpython/GitHub/wordtag.txt")

import re
texttag= filetag.read()

regextag ="(<word>'CHAP'</word><pos> '[A-Z]{2,5}'</pos>\r\n<word>'.'</word><pos> 'PUNCT'</pos>\r\n<word>'[A-Z]{1,7}'</word><pos> '[A-Z]{1,7}'</pos>)"

xx=re.split(regextag, texttag)

compteurchap=0
for chap in xx :
    if re.search(regextag, chap) : 
        compteurchap=compteurchap+1
        Dumas_XML.write("<chap"+str(compteurchap)+">\n")
        print("<head>"+chap+"</head>")
        Dumas_XML.write("<head>"+chap+"</head>")
    #else:
        Dumas_XML.write(chap)
        Dumas_XML.write("</chap>\n")

How can I do this correctly?

Please show a [mre]. We don't know what `texttag` is, and we don't know what exactly you expect to get as result. — mkrieger1, May 01 '22 at 16:56
Why oh why must we [constantly invoke ZA̡͊͠͝LGΌ](https://stackoverflow.com/a/1732454/364696)? (An unserious way to say "Use an XML parser for XML, stop trying to use regex for things they're not meant for, with all the attendant difficulty and brittleness) — ShadowRanger, May 01 '22 at 17:12
huh ok so i understant i can't use regex on html but this is xml, i'm a newbie but they're different... right ? anyway thank you i learned somthing at least — Gluelle, May 01 '22 at 17:14
@Gluelle: Neither XML nor HTML are things you should be parsing with regex. With the ridiculous way-more-than-just-regex engines some language (*cough* Perl *cough*), you *can* parse XML with them (it's more regular in certain key ways), but the regex you'd write to do it would be insanely verbose, brittle in the extreme, and essentially impossible to maintain. Regex is fine for hacking on simple data formats with no existing parsers, but for more complex formats, with proper parsers, please, use them. — ShadowRanger, May 02 '22 at 05:14

score 0 · Accepted Answer · answered May 01 '22 at 18:23

If you must use regex then this could be an option:

import re


pattern1 = re.compile(r"<word>.*?'NOUN'</pos>",re.MULTILINE | re.DOTALL)
pattern2 = re.compile(r"'NOUN'</pos>(.*)$", re.MULTILINE |re.DOTALL)

reobj = pattern1.search(texttag)

text = "<chap1>\n<head>"
text += reobj.group() + "\n</head>\n"
text += pattern2.findall(texttag)[0]
text += "\n</chap>\n"
print(text)
Dumas_XML.write(text)

output:

<chap1>
<head><word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
</head>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'                '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'    '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>
</chap>

Is that close to what you are looking for?

@Gluelle Your welcome. If that solved your problem please mark the question as answered/completed — Alexander, May 02 '22 at 21:05

Regular expression in Python and re.split splitting the wrong thing

1 Answers1