4

I want to open a file and get sentences. The sentences in the file go across lines, like this:

"He said, 'I'll pay you five pounds a week if I can have it on my own
terms.'  I'm a poor woman, sir, and Mr. Warren earns little, and the
money meant much to me.  He took out a ten-pound note, and he held it
out to me then and there. 

currently I'm using this code:

text = ' '.join(file_to_open.readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

readlines cuts through the sentences, is there a good way to solve this to get only the sentences? (without NLTK)

Thanks for you attention.

The current problem:

file_to_read = 'test.txt'

with open(file_to_read) as f:
    text = f.read()

import re
word_list = ['Mrs.', 'Mr.']     

for i in word_list:
    text = re.sub(i, i[:-1], text)

What I get back ( in the test case) is that Mrs. changed to Mr while Mr. is just Mr . I tried several other things, but don't seem to work. Answer is probably easy but I'm missing it

user3119123
  • 85
  • 1
  • 1
  • 6

2 Answers2

3

Your regex works on the text above if you do this:

with open(filename) as f:
    text = f.read()

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

The only problem is, the regex splits on the dot in "Mr." from your text above, so you need to fix/change that.

One solution to this, though not perfect, is you could take out all occurences of a dot after Mr:

text = re.sub(r'(M\w{1,2})\.', r'\1', text) # no for loop needed for this, like there was before

this Matches an 'M' followed by minimum 1, maximum 2 alphanumeric chars(\w{1,3}), followed by a dot. The parenthesised part of the pattern is grouped and captured, and it's referenced in the replacement as '\1'(or group 1, as you could have more parenthesised groups). So essentially, the Mr. or Mrs. is matched, but only the Mr or Mrs part is captured, and the Mr. or Mrs. is then replaced by the captured part which excludes the dot.

and then :

sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

will work the way you want.

Totem
  • 7,189
  • 5
  • 39
  • 66
  • Thx, this seems to be working! Im pretty new at this so this helps alot. Ye, the regex isn't perfect. I'm still working on it. If you have tips, I'd appreciate it! thx again – user3119123 Dec 21 '13 at 13:24
  • no harm in a good regex cheat sheet http://cloud.github.com/downloads/tartley/python-regex-cheatsheet/cheatsheet.pdf – Totem Dec 21 '13 at 13:26
  • I edited again. Let me know if this works for you and consider accepting it as the answer if it helps. – Totem Dec 21 '13 at 13:39
  • @ Totem. Great! :) Was already wrestling with a way to take out certain elements of the text. You really helped me out on that one, thanks alot! – user3119123 Dec 21 '13 at 13:40
  • thanks again for the help. just a small question. I tried the for loop and it seems that it adjusts only the last word in The_list. (I tried alternatives as well with the same result) Do you have any ideas? – user3119123 Dec 21 '13 at 14:54
  • 1
    ok, so first off, The_list as mentioned in my loop should consist of the words(like 'Mr.', 'Mrs.' or whatever else) that you want to change. make sure that the text you are replacing within is indeed a block of text acquired with f.read()(or whatever name f might have). I just tried it again myself, and it worked fine. If it won't work for you, maybe update your post with the code you have now, so I can have a look. – Totem Dec 21 '13 at 14:59
  • 1
    Also, IMPORTANT, make sure that within the for loop, you have 'text = re.sub(i, i[:-1], text)' and not just 're.sub(i, i[:-1], text)' – Totem Dec 21 '13 at 15:02
  • 1
    Ok I have edited my post with what I hope should fix your problem, It works for me.. please let me know. – Totem Dec 21 '13 at 18:16
  • Thanks so much for your help! Seems to be working now. Im gonna spend some more time learning regex. As you showed it can really help a person out. Thx again! – user3119123 Dec 21 '13 at 21:33
1

You may want to try out the text-sentence tokenizer module.

From their example code:

>>> from text_sentence import Tokenizer
>>> t = Tokenizer()
>>> list(t.tokenize("This is first sentence. This is second one!And this is third, is it?"))
[T('this'/sent_start), T('is'), T('first'), T('sentence'), T('.'/sent_end),
 T('this'/sent_start), T('is'), T('second'), T('one'), T('!'/sent_end),
 T('and'/sent_start), T('this'), T('is'), T('third'), T(','/inner_sep),
 T('is'), T('it'), T('?'/sent_end)]

I've never actually tried it though, I'd prefer using NLTK/punkt.

Community
  • 1
  • 1
Elias Dorneles
  • 22,556
  • 11
  • 85
  • 107