I am working on a project where one of the steps is to separate text of scientific articles into sentences. For this, I am using textrank
which I understands it looks for .
or ?
or !
etc. to identify end of the sentence of tokenization.
The problem I am running into is sentences that end with a period followed directly by a reference number (that also might be in brackets). The examples below represent the patterns I identified and collected so far.
xx = c ("hello.1 World", "hello.1,2 World", "hello.(1) world", "hello.(1,2) World", "hello.[1,2] World", "hello.[1] World")
I did some search and it looks like "Sentence boundary detection" is a science by itself that can get complex and domain specific.
The only way I can think of to fix this problem (in my case at least), is to write a regex that adds a space after the period so the textrank
can identify it using its usual pattern.
any suggestions how to do that with regex in R? I tried my best to search online but I could not find an answer.
This question explains how to add space between lower case followed by upper case. Add space between two letters in a string in R in my case, I believe I will need to add space between letter followed by period and number /bracket.
My expected output is something like:
("hello. 1 World", "hello. 1,2 World", "hello. (1) world", "hello. (1,2) World", "hello. [1,2] World", "hello. [1] World")
Thank you