0

I am trying to structure my text document in an xml structure, where each sentence gets an id. I have text documents with unstructured sentences and I would like to split the sentences using a '.' delimiter and write them to xml. Here is my code:

    import re

    #Read the file
    with open ('C:\\Users\\ngwak\\Documents\\test.txt') as f:
        content = [f]
        split_content = []
        for element in content:
            split_content += re.split("(.)\s+", element)

        print(split_content, sep='\n\n')

But I am getting this error already and I cant interpret it:

    TypeError: expected string or buffer

How can I split my sentences and write them to xml? Thanks a lot. This is how my txt file looks like:

In a formal sense, the germ of national consciousness can be traced back to the Peace Treaty of Hoachanas signed in 13–June-1858 between soldiers, all the chiefs except those of the Bondelswarts (who had not been involved in the previous fighting), as well as by Muewuta, two sons of amuaha, formerly a Commandant of Chief Onag of the Triku people. There is ample epistolary as well as oral evidence for this view. The most poignant statement is to be found in the now famous and oft-quoted letter of Onag to Bonagha written on May 13, 1890 in which, amongst other things, he says that on June 13 there are people coming. Again on the 01.02.2015 till the 01.05 there are some coming.

And I would like the sentences to be like this in xml:

    <sentence id=01>In a formal sense, the germ of national consciousness 
    can be traced back to the Peace Treaty of Hoachanas signed in 13–June-
    1858 between soldiers, all the  chiefs except those of the Bondelswarts 
    (who had not been involved in the previous fighting), as well as by 
    Muewuta, two sons of  amuaha, formerly a Commandant of Chief Onag of the 
    Triku people. </sentence>
Sumner Evans
  • 8,951
  • 5
  • 30
  • 47

2 Answers2

3
text_file = open('C:\\Users\\ngwak\\Documents\\test.txt', "r")
textLinesFromFile = text_file.read().replace("\n","").split('.')

for sentenceNumber in range (0,len(textLinesFromFile)):
    print (textLinesFromFile[sentenceNumber].strip())
    #Or write each sentence in your XML
Sachin Patel
  • 499
  • 2
  • 12
2

You don't need the content = [f] line.

with open ('C:\\Users\\ngwak\\Documents\\test.txt') as file:
    split_content = []
    for element in file:
        split_content += re.split("(.)\s+", element)

    print(split_content, sep='\n\n')

File objects are iterable. Using them in a for loop will iterate over each line.


Further Reading

Sumner Evans
  • 8,951
  • 5
  • 30
  • 47
  • @Summer Evans thanks for your correction. I noticed I didnt need the content = [f] but when I changed it I still cannot print to see the split sentences. – Nampa Gwakondo Jul 26 '17 at 15:45
  • @NampaGwakondo, please update your question and add a bit more description on what you are seeing currently and what you want to see. – Sumner Evans Jul 26 '17 at 15:51
  • I am actually not seeing anything. I have edited my question as to what I want to have at the end of the day. – Nampa Gwakondo Jul 26 '17 at 16:05