0

I've been trying to organize a text using Python but my attempt at using re.split is not working, even if my regular expression is good (I've tried it on notepad++).

I need to split my text using the regular expression (and keep what has been found) but the text is being split caracter by caracter.

texttag is a txt file that looks like this :

<word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'                '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'    '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>

And i'm trying to split the

<word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>

and i'm trying to split and tag it in such a manner :

<chap1>
<head><word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
</head>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'                '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'    '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>
</chap>

here is my whole code for now :

Dumas_XML=open("D:/cours/M1/S2/PTpython/GitHub/test2.txt","a") #C:/Users/super/Desktop/PTpython/GitHub/
#puverture du header xml
Dumas_XML.write('<?xml version="1.0" encoding="UTF-8"?>\n')
Dumas_XML.write('<Doc name="DUMAS" path="C:/Users/super/Desktop/PTpython/GitHub/textes"></Doc>\n') |6
Dumas_XML.write('<Document num="1" taille= "nombre de mots int()"/> </Document> \n ')

filetag = open("D:/cours/M1/S2/PTpython/GitHub/wordtag.txt")

import re
texttag= filetag.read()

regextag ="(<word>'CHAP'</word><pos> '[A-Z]{2,5}'</pos>\r\n<word>'.'</word><pos> 'PUNCT'</pos>\r\n<word>'[A-Z]{1,7}'</word><pos> '[A-Z]{1,7}'</pos>)"

xx=re.split(regextag, texttag)

compteurchap=0
for chap in xx :
    if re.search(regextag, chap) : 
        compteurchap=compteurchap+1
        Dumas_XML.write("<chap"+str(compteurchap)+">\n")
        print("<head>"+chap+"</head>")
        Dumas_XML.write("<head>"+chap+"</head>")
    #else:
        Dumas_XML.write(chap)
        Dumas_XML.write("</chap>\n")

How can I do this correctly?

Gluelle
  • 3
  • 3
  • Please show a [mre]. We don't know what `texttag` is, and we don't know what exactly you expect to get as result. – mkrieger1 May 01 '22 at 16:56
  • Ok should i try editing the question or do it the comments? – Gluelle May 01 '22 at 16:58
  • Update your code on the question –  May 01 '22 at 17:02
  • so you are splitting HTML? –  May 01 '22 at 17:02
  • this is XML it looks very close – Gluelle May 01 '22 at 17:09
  • Why oh why must we [constantly invoke ZA̡͊͠͝LGΌ](https://stackoverflow.com/a/1732454/364696)? (An unserious way to say "Use an XML parser for XML, stop trying to use regex for things they're not meant for, with all the attendant difficulty and brittleness) – ShadowRanger May 01 '22 at 17:12
  • huh ok so i understant i can't use regex on html but this is xml, i'm a newbie but they're different... right ? anyway thank you i learned somthing at least – Gluelle May 01 '22 at 17:14
  • Is it a requirement that you use regex? – Alexander May 01 '22 at 17:47
  • @Gluelle: Neither XML nor HTML are things you should be parsing with regex. With the ridiculous way-more-than-just-regex engines some language (*cough* Perl *cough*), you *can* parse XML with them (it's more regular in certain key ways), but the regex you'd write to do it would be insanely verbose, brittle in the extreme, and essentially impossible to maintain. Regex is fine for hacking on simple data formats with no existing parsers, but for more complex formats, with proper parsers, please, use them. – ShadowRanger May 02 '22 at 05:14

1 Answers1

0

If you must use regex then this could be an option:

import re


pattern1 = re.compile(r"<word>.*?'NOUN'</pos>",re.MULTILINE | re.DOTALL)
pattern2 = re.compile(r"'NOUN'</pos>(.*)$", re.MULTILINE |re.DOTALL)

reobj = pattern1.search(texttag)

text = "<chap1>\n<head>"
text += reobj.group() + "\n</head>\n"
text += pattern2.findall(texttag)[0]
text += "\n</chap>\n"
print(text)
Dumas_XML.write(text)

output:

<chap1>
<head><word>'CHAP'</word><pos> 'ADJ'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'Ier'</word><pos> 'NOUN'</pos>
</head>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'                '</word><pos> 'SPACE'</pos>
<word>'Marseille'</word><pos> 'PROPN'</pos>
<word>'.'</word><pos> 'PUNCT'</pos>
<word>'L’'</word><pos> 'PROPN'</pos>
<word>'arrivée'</word><pos> 'NOUN'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'\n'</word><pos> 'SPACE'</pos>
<word>'    '</word><pos> 'SPACE'</pos>
<word>'Le'</word><pos> 'DET'</pos>
<word>'24'</word><pos> 'NUM'</pos>
<word>'février'</word><pos> 'NOUN'</pos>
<word>'1815'</word><pos> 'NUM'</pos>
</chap>

Is that close to what you are looking for?

Alexander
  • 16,091
  • 5
  • 13
  • 29