Basic numbering of sentence splits?

Question

Possible Duplicate:
Numbering the sentences inside a <P> in a .xml file?

I'm just starting out in programming, so this problem is very trivial, except for me. I have a .xml file containing content like:

<p> sentence1. sentence2. sentence3.</p>
<p> sentence1. </p>

Now I have written a script with BeautifulSoup to append each single paragraph ending with a STRING, so it looks like:

<p> sentence1. sentence2. sentence3. STRING</p>
<p> sentence1. STRING </p>

In the < p > that contain only 1 sentence that is all I want to do. But if a < p > contains more than I sentence, I want to add the STRING to each sentence ending + the sentence number. For example the upper paragraph would be:

<p> sentence1. STRING1 sentence2. STRING2 sentence3. STRING3 </p>

Here is my working script for 1 sentence with the .append method, but I couldn't get it to work for multiple sentences. Any help would be appreciated!

soup = BeautifulSoup(xmlfile)
p = norm.findAll("p")

for i in p:
    dotsplit = re.compile(r'\. \w')
    sentences = dotsplit.split(i.text)

    if len(sentences) == 1:
        appendix = "STRING"
        i.append(appendix)
        print i

    if len(sentences) > 1:
        for x in sentences:
            sentencenumber = ???????  
            # Should equal (index of sentences)+1,  meaning sentences[0] = 1
            appendix = sentencenumber + "STRING"
            i.append(appendix)
            print i

How is this different from your previous post: http://stackoverflow.com/questions/12643798/numbering-the-sentences-inside-a-p-in-a-xml-file ? — Jon Clements, Sep 30 '12 at 12:04
On somewhat unrelated notes: I'm not sure where `norm` has come from. Also, there's little point in using `re.compile` inside a loop and re-assigning it every time -- put it outside the loop, or just use `re.split(r'\. \w')` - the library will "intern" the string, and "cache" the regex anyway... — Jon Clements, Sep 30 '12 at 12:53

score 1 · Answer 1 · answered Sep 30 '12 at 12:04

1

That should be enough:

if len(sentences) > 1:
    for n, x in enumerate(sentences):
        sentencenumber = n + 1

answered Sep 30 '12 at 12:04

ILJICH

36
3

you can given the start number to enumerate(, ), so you don't need to do n + 1. just change the loop to "for sentence_number, x in enumerate(sentences, 1):" – monkut Sep 30 '12 at 12:06
Thank you for vour answer! I have managed to get the count, but I dont know how to append it after each sentence. At the moment, all the STRINGs are added to the end of the paragraph. If I use x.append I get an error, because this method does not apply to unicode? – Elip Sep 30 '12 at 12:50

kalgasnik · Accepted Answer · 2012-09-30T13:44:53.367

If I understand you correctly:

if len(sentences) == 1:
    print sentences[0] + 'STRING'
elif len(sentences) > 1:
    isentences = ('%s%s%d' % (s, 'STRING', i) for i, s in enumerate(sentences, 1))
    print ' '.join(isentences)

I dont know how to append it after each sentence

BeautifulSoup documentation say that you must use method tag.string.replace_with instead of tag.append:

    isentences = ('%s%s%d' % (s, 'STRING', i) for i, s in enumerate(sentences, 1))
    i.string.replace_with(' '.join(isentences))

Basic numbering of sentence splits?

2 Answers2