0

For example, I have 3 sentences like at below where 1 sentence in the middle contains citation mark (Warren and Pereira, 1982). The citation is always in bracket with this format: (~string~comma(,)~space~number~)

He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits.

I'm using Regex to extract only the middle sentence but it keeps print all the 3 sentences. The result should be like this:

The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).

Dharman
  • 30,962
  • 25
  • 85
  • 135
gameon67
  • 3,981
  • 5
  • 35
  • 61
  • is it always the middle sentence or the citation is always in brackets? – A H Bensiali Aug 13 '17 at 08:35
  • It's not always in the middle sentence, the most important thing is the citation is always in bracket with this format (~string~comma(,)~space~number~) – gameon67 Aug 13 '17 at 08:37

2 Answers2

2

The setup... 2 sentences representing the cases of interest:

text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

t2 = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural. CHAT-80 was a state of the art natural language system that was impressive on its own merits."

First, to match in the case where the citation is at the end of a sentence:

p1 = "\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"

To match when the citation is not at the end of a sentence:

p2 = "\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)"

Combining both cases with the `|' regex operator:

p_main = re.compile("\. (.*\([A-za-z]+ .* [0-9]+\)\.+?)"
                "|\. (.*\([A-za-z]+ .* [0-9]+\)[^\.]+\.+?)")

Running:

>>> print(re.findall(p_main, text))
[('The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982).', '')]

>>>print(re.findall(p_main, t2))
[('', 'The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982) fgbhdr was a state of the art natural.')]

In both cases you get the sentence with the citation.

A good resource is the python regular expressions documentation and the accompanying regex howto page.

Cheers

Xero Smith
  • 1,968
  • 1
  • 14
  • 19
0
text = "He lives in Nidarvoll and tonight i must reach a train to Oslo at 6 oclock. The system, called BusTUC is built upon the classical system CHAT-80 (Warren and Pereira, 1982). CHAT-80 was a state of the art natural language system that was impressive on its own merits."

You can split the text into a list of sentences and then pick the ones that end with ")".

sentences = text.split(".")[:-1]

for sentence in sentences:
    if sentence[-1] == ")":
        print sentence
Eren Tantekin
  • 1,461
  • 14
  • 24
  • Thx so I dont always have to use Regex. But what if the citation is not at the end of the sentence? And what if the surrounding sentences have string like this "Mr. John" (has dot) so we cant split each sentence with '.' – gameon67 Aug 13 '17 at 08:55