How to split texts into sentences and write them to xml

Question

I am trying to structure my text document in an xml structure, where each sentence gets an id. I have text documents with unstructured sentences and I would like to split the sentences using a '.' delimiter and write them to xml. Here is my code:

    import re

    #Read the file
    with open ('C:\\Users\\ngwak\\Documents\\test.txt') as f:
        content = [f]
        split_content = []
        for element in content:
            split_content += re.split("(.)\s+", element)

        print(split_content, sep='\n\n')

But I am getting this error already and I cant interpret it:

    TypeError: expected string or buffer

How can I split my sentences and write them to xml? Thanks a lot. This is how my txt file looks like:

In a formal sense, the germ of national consciousness can be traced back to the Peace Treaty of Hoachanas signed in 13–June-1858 between soldiers, all the chiefs except those of the Bondelswarts (who had not been involved in the previous fighting), as well as by Muewuta, two sons of amuaha, formerly a Commandant of Chief Onag of the Triku people. There is ample epistolary as well as oral evidence for this view. The most poignant statement is to be found in the now famous and oft-quoted letter of Onag to Bonagha written on May 13, 1890 in which, amongst other things, he says that on June 13 there are people coming. Again on the 01.02.2015 till the 01.05 there are some coming.

And I would like the sentences to be like this in xml:

    <sentence id=01>In a formal sense, the germ of national consciousness 
    can be traced back to the Peace Treaty of Hoachanas signed in 13–June-
    1858 between soldiers, all the  chiefs except those of the Bondelswarts 
    (who had not been involved in the previous fighting), as well as by 
    Muewuta, two sons of  amuaha, formerly a Commandant of Chief Onag of the 
    Triku people. </sentence>

There is only one element in `content` and it's a file object. I'm not sure why you're doing `content = [f]` — roganjosh, Jul 26 '17 at 15:23
@Summer Evans on the line split_content += re.split("(.)\s+", element). I got that, I was trying to parse the content over... Anyways, I change it but it not printing anything for me to see how it split the sentences. — Nampa Gwakondo, Jul 26 '17 at 15:41

score 3 · Accepted Answer · answered Jul 26 '17 at 16:11

3

text_file = open('C:\\Users\\ngwak\\Documents\\test.txt', "r")
textLinesFromFile = text_file.read().replace("\n","").split('.')

for sentenceNumber in range (0,len(textLinesFromFile)):
    print (textLinesFromFile[sentenceNumber].strip())
    #Or write each sentence in your XML

answered Jul 26 '17 at 16:11

Sachin Patel

499
2
12

Does exactly what I need for splitting the sentences. Thank you. Do you know how I can write each sentence to xml?. I will try too. – Nampa Gwakondo Jul 26 '17 at 16:15
@NampaGwakondo, search for Python XML libraries. (Also, this should be the accepted answer.) – Sumner Evans Jul 26 '17 at 16:42
@SumnerEvans could you please have a look at my update. I am trying to save the results to a new textfile but I have an error. – Nampa Gwakondo Jul 28 '17 at 19:20

Sumner Evans · Answer 2 · 2017-07-26T15:28:59.147

2

You don't need the content = [f] line.

with open ('C:\\Users\\ngwak\\Documents\\test.txt') as file:
    split_content = []
    for element in file:
        split_content += re.split("(.)\s+", element)

    print(split_content, sep='\n\n')

File objects are iterable. Using them in a for loop will iterate over each line.

Further Reading

Methods on File objects in the Python Docs
The example in this SO answer: Iterating on a file using Python

edited Jul 26 '17 at 15:28

answered Jul 26 '17 at 15:23

Sumner Evans

8,951
5
30
47

@Summer Evans thanks for your correction. I noticed I didnt need the content = [f] but when I changed it I still cannot print to see the split sentences. – Nampa Gwakondo Jul 26 '17 at 15:45
@NampaGwakondo, please update your question and add a bit more description on what you are seeing currently and what you want to see. – Sumner Evans Jul 26 '17 at 15:51
I am actually not seeing anything. I have edited my question as to what I want to have at the end of the day. – Nampa Gwakondo Jul 26 '17 at 16:05

How to split texts into sentences and write them to xml

2 Answers2