I have lots of text files that I need to convert to .xml in order to be able to work with more efficiently (I am supposed to be doing a couple of language models to analyze English dialects)
the files go like this:
<I> <IFL-IDN W2C-001 #1:1> <#> <h> <bold> Some Statement that I can edit </bold> <bold> followed by another </bold> </h>
<IFL-IDN W2C-001 #2:1> <p> <#> more and more text that is not very relevant . </p></I>
There are about 500 words per file, what I want to do is to identify the tags, and close the unclosed ones like <#> and at the end of the sentence.
then I'd like to convert the whole .txt files to valid xml files with before and after every word. I could have separated that with .split() but the problem is those kind of tags have spaces in them.
The best code I could come up with is to splilines(), then .split() on a sentence, then try to Identify the
here is the code for that
Korpus = open("w2c-001.txt").read().splitlines()
for i in Korpus:
Sentence = i.split()
for j in range(0,len(Sentence)-2):
if((Sentence[j][0]=='<' and Sentence[j][len(Sentence[j])-1]!='>') or( Sentence[j][0]!='<' and Sentence[j][len(Sentence[j])-1]=='>')):
Sentence[j] = Sentence[j] + " " + Sentence[j+1] +" " + Sentence[j+2]
Sentence.remove(Sentence[j+1])
Sentence.remove(Sentence[j+2])
#print(Sentence[j])
print(Sentence[j])
My intial thought was If I can write something even to save a valid xml in a .txt file, converting that file to a .xml shouldn't be a big porblem. I can't find a python library that can do this, eltree library can create xml, but I found nothing to identify it and convert it.
Thank you in advance, any help would be very appreciated.