0

I have lots of text files that I need to convert to .xml in order to be able to work with more efficiently (I am supposed to be doing a couple of language models to analyze English dialects)

the files go like this:

<I> <IFL-IDN W2C-001 #1:1> <#> <h> <bold> Some Statement that I can edit </bold> <bold> followed by another </bold> </h> 
    <IFL-IDN W2C-001 #2:1> <p> <#> more and more text that is not very relevant . </p></I>

There are about 500 words per file, what I want to do is to identify the tags, and close the unclosed ones like <#> and at the end of the sentence.

then I'd like to convert the whole .txt files to valid xml files with before and after every word. I could have separated that with .split() but the problem is those kind of tags have spaces in them.

The best code I could come up with is to splilines(), then .split() on a sentence, then try to Identify the

here is the code for that

Korpus = open("w2c-001.txt").read().splitlines()

for i in Korpus:
    Sentence = i.split()
    for j in  range(0,len(Sentence)-2):
        if((Sentence[j][0]=='<' and Sentence[j][len(Sentence[j])-1]!='>') or( Sentence[j][0]!='<' and Sentence[j][len(Sentence[j])-1]=='>')):
            Sentence[j] = Sentence[j] + " " + Sentence[j+1] +" " + Sentence[j+2]
            Sentence.remove(Sentence[j+1])
            Sentence.remove(Sentence[j+2])
            #print(Sentence[j])
        print(Sentence[j])    

My intial thought was If I can write something even to save a valid xml in a .txt file, converting that file to a .xml shouldn't be a big porblem. I can't find a python library that can do this, eltree library can create xml, but I found nothing to identify it and convert it.

Thank you in advance, any help would be very appreciated.

Darwin
  • 3
  • 3
  • I suggest using a parser http://stackoverflow.com/questions/1912434/how-do-i-parse-xml-in-python – Alexis Benichoux Jun 15 '16 at 16:22
  • I can use those libraries to parse to xml provided that my txt files have valid xml in them, the problem is that they don't, there is no root element for example, and one element has spaces in it and is unclosed. once I get valid xml in my txt files I could then use the parser, but now, it just gives me xml syntacts errors – Darwin Jun 18 '16 at 11:53
  • XML is not a very specific thing. It's a data representation format - so you need to include in your question I think the format of the xml you want out - and for some background, also why you need it. – John Deverall Jun 19 '16 at 16:07

1 Answers1

0

First, you don't have to load the file and split lines, you can iterate over the lines. An xml parser can be applied to each line separatly.

Korpus = open("w2c-001.txt")
for line in Korpus:
    ...

If you want to parse it yourself, use regular expressions to find the tags

import re
re.findall(r'<[a-z]*>','<h> <bold> Some Statement that I can edit </bold> <bold> followed by another </bold> </h>')

XML is not a file format, it is a langage, just write plain text to .xml file and you're done.

Alexis Benichoux
  • 790
  • 4
  • 13