0

The text in question is http://pastebin.com/gD65sS22, an abstract from a paper about viral media. I use the example code from http://www.nltk.org/api/nltk.tokenize.html, e.g. i read the textfile, load the punkt/english.pickle tokenizer, and print the resulting sentences.

The output is bad, essentially. Almost none of the 'e.g.'s are correctly ignored, several citations go bad...

Is this just general NLTK weakness or am i doing something wrong? Should i investigate using regexes instead?

2 Answers2

1

First your text is a little noisy and if nltk.sent_tokenize see a newline \r\n it will break it and use it as a sentence boundary. Next, sent_tokenize isn't very good at text with in-sentence fullstops. E.g.

from urllib.request import urlopen, Request
from nltk import sent_tokenize

request = Request("http://pastebin.com/raw.php?i=gD65sS22")
response = urlopen(request)

text = " ".join(response.read().decode('utf8').split('\r\n'))

for sent in sent_tokenize(text):
    print (sent)

[out]:

Testing text / paper abstract taken from http://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10505  Viral videos have become a staple of the social Web.
The term refers to videos that are uploaded to video sharing sites such as YouTube, Vimeo, or Blip.tv and more or less quickly gain the attention of millions of people.
Viral videos mainly contain humorous content such as bloopers in television shows (eg.
boom goes the dynamite) or quirky Web productions (eg.
nyan cat).
Others show extraordinary events caught on video (eg.
battle at Kruger) or contain political messages (eg.
kony 2012).
The arguably most prominent example, however, is the music video Gangnam style by PSY which, as of January 2015, has been viewed over 2 billion times on YouTube.
Yet, while the recent surge in viral videos has been attributed to the availability of affordable digital cameras and video sharing sites (Grossman 2006), viral Web videos predate modern social media.
An example is the dancing baby which appeared in 1996 and was mainly shared via email.
The fact that videos became Internet phenomena already before the first video sharing sites appeared suggests that collective attention to viral videos may spread in form of a contact process.
Put differently, it seems reasonable to surmise that attention to viral videos spreads through the Web very much as viruses spread through the world.
Indeed, the times series shown in Fig.
1 support this intuition.
They show exemplary developments of YouTube view counts and Google searches related to recent viral videos and closely resemble the progress of infection counts often observed in epidemic outbreaks.
However, although viral videos attract growing research efforts, the suitability of the viral metaphor was apparently not studied systematically yet.
In this paper, we therefore ask to what extend the dynamics in Fig.
1 can be explained in terms of the dynamics of epidemics?
This question extends existing viral video research which, so far, can be distinguished into two broad categories: On the one hand, researchers especially in the humanities and in marketing, ask for what it is that draws attention to viral videos (Burgess 2008; Southgate, Westoby, and Page 2010).
In a recent study, Shifman (2012) looked at attributes common to viral videos and, based on a corpus of 30 prominent examples, identified six predominant features, namely: focus on ordinary people, flawed masculinity, humor, simplicity, repetitiveness, and whimsical content.
However, while he argues that these attributes mark a video as incomplete or flawed and therefore invoke further attention or creative dialogue, the presence of these key signifiers does not im- ply virality.
After all, there are millions of videos that show these attributes but never attract significant viewership Another popular line of research, especially among data scientists, therefore consists in analyzing viewing patterns of viral videos.
For instance, Figueiredo et al.
(2011) found that the temporal dynamics of view counts of YouTube videos seem to depend on whether or not the material is copyrighted.
While copyrighted videos (typically music videos) were observed to reach peak popularity early in their lifetime, other viral videos had been available for quite some time  before  they  experienced  sudden  significant  bursts  in popularity.
In addition, the authors observed that these bursts depended  on  external  factors  such  as  being  listed  on  the YouTube front page.
The importance of external effects for the viral success of a video was also noted by Broxton et al.
(2013) who found that viewership patterns of YouTube videos strongly depend on referrals from sites such as Face- book or Twitter.
In particular, they observed that ‘social’ videos with many outside referrals rise to and fall from peak popularity much quicker than ‘less social’ ones.
Sudden bursts in view counts seem to be suitable predictors of a video’s future popularity (Crane and Sornette 2008; Pinto, Almeida, and Goncalves 2013; Jiang et al.
2014).
In fact, it appears that initial view count statistics combined with additional information as to, say, video related sharing activities in other social media, allow for predicting whether or not a video will ’go viral’ soon (Shamma et al.
2011; Jain, Manweiler, and Choudhury 2014).
Yet, Broxton et al.
(2013) point out that not all ‘social’ videos go viral and not all viral videos are indeed ‘social’.
Given this interest in video related time series analysis, it is surprising that the viral metaphor has not been scrutinized from this angle.
To the best of our knowledge, the most closely related work is found in a recent report by CintroArias (2014) who attempted to match an intricate infectious disease model to view count data for the video Gangnam Style.
We, too, investigate the attention dynamics of viral videos from the point of view of mathematical epidemiology and present results based on a data set of more than 800 time series.
Our contributions are of theoretical and empirical nature, namely: 1) we introduce a simple yet expressive probabilistic model of the dynamics of epidemics; in contrast to traditional approaches, our model admits a closed form expression for the evolution of infected counts and we show that it amounts to the convolution of two geometric distributions 2) we introduce a time continuous characterization of this result; major advantages of this continuous model are that it is analytically tractable and allows for the use of highly robust maximum likelihood techniques in model fitting as well as for easily interpretable results 3) we fit our model to YouTube view count data and Google Trends time series which reflect collective attention to prominent viral videos and find it to fit well.
Our work therefore constitutes a data scientific approach towards viral video research.
However, it is model- rather than data driven.
This way, we follow arguments brought forth, for instance, by Bauckhage et al.
(2013) or Lazer et al.
(2014) who criticized the lack of interpretability and the ‘big data hubris’ of purely data driven approaches for their potential of over-fitting and misleading results.
Our presentation proceeds as follows: Next, we review concepts from mathematical epidemiology, briefly discuss approaches based on systems of differential equations, and introduce the probabilistic model that forms the basis for our study; mathematical details behind this model are deferred to the Appendix.
Then, we present the data we analyzed and discuss our empirical results.
We conclude by summarizing our approach, results, and implications of our findings.

Now let's try some hacks:

from urllib.request import urlopen, Request
from nltk import sent_tokenize

def hack(text):
    return text.replace('et al. ', 'et_al._')   

def unhack(text):
    return text.replace('et_al._', 'et al. ')   


request = Request("http://pastebin.com/raw.php?i=gD65sS22")
response = urlopen(request)

text = " ".join(response.read().decode('utf8').split('\r\n'))
text = hack(text)

for sent in sent_tokenize(text):
    print (unhack(sent))

[out]:

Testing text / paper abstract taken from http://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10505  Viral videos have become a staple of the social Web.
The term refers to videos that are uploaded to video sharing sites such as YouTube, Vimeo, or Blip.tv and more or less quickly gain the attention of millions of people.
Viral videos mainly contain humorous content such as bloopers in television shows (eg.
boom goes the dynamite) or quirky Web productions (eg.
nyan cat).
Others show extraordinary events caught on video (eg.
battle at Kruger) or contain political messages (eg.
kony 2012).
The arguably most prominent example, however, is the music video Gangnam style by PSY which, as of January 2015, has been viewed over 2 billion times on YouTube.
Yet, while the recent surge in viral videos has been attributed to the availability of affordable digital cameras and video sharing sites (Grossman 2006), viral Web videos predate modern social media.
An example is the dancing baby which appeared in 1996 and was mainly shared via email.
The fact that videos became Internet phenomena already before the first video sharing sites appeared suggests that collective attention to viral videos may spread in form of a contact process.
Put differently, it seems reasonable to surmise that attention to viral videos spreads through the Web very much as viruses spread through the world.
Indeed, the times series shown in Fig.
1 support this intuition.
They show exemplary developments of YouTube view counts and Google searches related to recent viral videos and closely resemble the progress of infection counts often observed in epidemic outbreaks.
However, although viral videos attract growing research efforts, the suitability of the viral metaphor was apparently not studied systematically yet.
In this paper, we therefore ask to what extend the dynamics in Fig.
1 can be explained in terms of the dynamics of epidemics?
This question extends existing viral video research which, so far, can be distinguished into two broad categories: On the one hand, researchers especially in the humanities and in marketing, ask for what it is that draws attention to viral videos (Burgess 2008; Southgate, Westoby, and Page 2010).
In a recent study, Shifman (2012) looked at attributes common to viral videos and, based on a corpus of 30 prominent examples, identified six predominant features, namely: focus on ordinary people, flawed masculinity, humor, simplicity, repetitiveness, and whimsical content.
However, while he argues that these attributes mark a video as incomplete or flawed and therefore invoke further attention or creative dialogue, the presence of these key signifiers does not im- ply virality.
After all, there are millions of videos that show these attributes but never attract significant viewership Another popular line of research, especially among data scientists, therefore consists in analyzing viewing patterns of viral videos.
For instance, Figueiredo et al. (2011) found that the temporal dynamics of view counts of YouTube videos seem to depend on whether or not the material is copyrighted.
While copyrighted videos (typically music videos) were observed to reach peak popularity early in their lifetime, other viral videos had been available for quite some time  before  they  experienced  sudden  significant  bursts  in popularity.
In addition, the authors observed that these bursts depended  on  external  factors  such  as  being  listed  on  the YouTube front page.
The importance of external effects for the viral success of a video was also noted by Broxton et al. (2013) who found that viewership patterns of YouTube videos strongly depend on referrals from sites such as Face- book or Twitter.
In particular, they observed that ‘social’ videos with many outside referrals rise to and fall from peak popularity much quicker than ‘less social’ ones.
Sudden bursts in view counts seem to be suitable predictors of a video’s future popularity (Crane and Sornette 2008; Pinto, Almeida, and Goncalves 2013; Jiang et al. 2014).
In fact, it appears that initial view count statistics combined with additional information as to, say, video related sharing activities in other social media, allow for predicting whether or not a video will ’go viral’ soon (Shamma et al. 2011; Jain, Manweiler, and Choudhury 2014).
Yet, Broxton et al. (2013) point out that not all ‘social’ videos go viral and not all viral videos are indeed ‘social’.
Given this interest in video related time series analysis, it is surprising that the viral metaphor has not been scrutinized from this angle.
To the best of our knowledge, the most closely related work is found in a recent report by CintroArias (2014) who attempted to match an intricate infectious disease model to view count data for the video Gangnam Style.
We, too, investigate the attention dynamics of viral videos from the point of view of mathematical epidemiology and present results based on a data set of more than 800 time series.
Our contributions are of theoretical and empirical nature, namely: 1) we introduce a simple yet expressive probabilistic model of the dynamics of epidemics; in contrast to traditional approaches, our model admits a closed form expression for the evolution of infected counts and we show that it amounts to the convolution of two geometric distributions 2) we introduce a time continuous characterization of this result; major advantages of this continuous model are that it is analytically tractable and allows for the use of highly robust maximum likelihood techniques in model fitting as well as for easily interpretable results 3) we fit our model to YouTube view count data and Google Trends time series which reflect collective attention to prominent viral videos and find it to fit well.
Our work therefore constitutes a data scientific approach towards viral video research.
However, it is model- rather than data driven.
This way, we follow arguments brought forth, for instance, by Bauckhage et al. (2013) or Lazer et al. (2014) who criticized the lack of interpretability and the ‘big data hubris’ of purely data driven approaches for their potential of over-fitting and misleading results.
Our presentation proceeds as follows: Next, we review concepts from mathematical epidemiology, briefly discuss approaches based on systems of differential equations, and introduce the probabilistic model that forms the basis for our study; mathematical details behind this model are deferred to the Appendix.
Then, we present the data we analyzed and discuss our empirical results.
We conclude by summarizing our approach, results, and implications of our findings.

Now that looks better but there's still problems with (e.g. .... and Fig. .... Let's continue hacking:

from urllib.request import urlopen, Request
from nltk import sent_tokenize

def hack(text):
    text = text.replace('et al. ', 'et_al._')
    text = text.replace('eg. ', 'eg._')
    text = text.replace('Fig. ', 'Fig._')
    return text

def unhack(text):
    text = text.replace('et_al._', 'et al. ')
    text = text.replace('eg._', 'eg.')
    text = text.replace('Fig._', 'Fig.')
    return  text


request = Request("http://pastebin.com/raw.php?i=gD65sS22")
response = urlopen(request)

text = " ".join(response.read().decode('utf8').split('\r\n'))
text = hack(text)

for sent in sent_tokenize(text):
    print (unhack(sent))

[out]:

Testing text / paper abstract taken from http://www.aaai.org/ocs/index.php/ICWSM/ICWSM15/paper/view/10505  Viral videos have become a staple of the social Web.
The term refers to videos that are uploaded to video sharing sites such as YouTube, Vimeo, or Blip.tv and more or less quickly gain the attention of millions of people.
Viral videos mainly contain humorous content such as bloopers in television shows (eg.boom goes the dynamite) or quirky Web productions (eg.nyan cat).
Others show extraordinary events caught on video (eg.battle at Kruger) or contain political messages (eg.kony 2012).
The arguably most prominent example, however, is the music video Gangnam style by PSY which, as of January 2015, has been viewed over 2 billion times on YouTube.
Yet, while the recent surge in viral videos has been attributed to the availability of affordable digital cameras and video sharing sites (Grossman 2006), viral Web videos predate modern social media.
An example is the dancing baby which appeared in 1996 and was mainly shared via email.
The fact that videos became Internet phenomena already before the first video sharing sites appeared suggests that collective attention to viral videos may spread in form of a contact process.
Put differently, it seems reasonable to surmise that attention to viral videos spreads through the Web very much as viruses spread through the world.
Indeed, the times series shown in Fig.1 support this intuition.
They show exemplary developments of YouTube view counts and Google searches related to recent viral videos and closely resemble the progress of infection counts often observed in epidemic outbreaks.
However, although viral videos attract growing research efforts, the suitability of the viral metaphor was apparently not studied systematically yet.
In this paper, we therefore ask to what extend the dynamics in Fig.1 can be explained in terms of the dynamics of epidemics?
This question extends existing viral video research which, so far, can be distinguished into two broad categories: On the one hand, researchers especially in the humanities and in marketing, ask for what it is that draws attention to viral videos (Burgess 2008; Southgate, Westoby, and Page 2010).
In a recent study, Shifman (2012) looked at attributes common to viral videos and, based on a corpus of 30 prominent examples, identified six predominant features, namely: focus on ordinary people, flawed masculinity, humor, simplicity, repetitiveness, and whimsical content.
However, while he argues that these attributes mark a video as incomplete or flawed and therefore invoke further attention or creative dialogue, the presence of these key signifiers does not im- ply virality.
After all, there are millions of videos that show these attributes but never attract significant viewership Another popular line of research, especially among data scientists, therefore consists in analyzing viewing patterns of viral videos.
For instance, Figueiredo et al. (2011) found that the temporal dynamics of view counts of YouTube videos seem to depend on whether or not the material is copyrighted.
While copyrighted videos (typically music videos) were observed to reach peak popularity early in their lifetime, other viral videos had been available for quite some time  before  they  experienced  sudden  significant  bursts  in popularity.
In addition, the authors observed that these bursts depended  on  external  factors  such  as  being  listed  on  the YouTube front page.
The importance of external effects for the viral success of a video was also noted by Broxton et al. (2013) who found that viewership patterns of YouTube videos strongly depend on referrals from sites such as Face- book or Twitter.
In particular, they observed that ‘social’ videos with many outside referrals rise to and fall from peak popularity much quicker than ‘less social’ ones.
Sudden bursts in view counts seem to be suitable predictors of a video’s future popularity (Crane and Sornette 2008; Pinto, Almeida, and Goncalves 2013; Jiang et al. 2014).
In fact, it appears that initial view count statistics combined with additional information as to, say, video related sharing activities in other social media, allow for predicting whether or not a video will ’go viral’ soon (Shamma et al. 2011; Jain, Manweiler, and Choudhury 2014).
Yet, Broxton et al. (2013) point out that not all ‘social’ videos go viral and not all viral videos are indeed ‘social’.
Given this interest in video related time series analysis, it is surprising that the viral metaphor has not been scrutinized from this angle.
To the best of our knowledge, the most closely related work is found in a recent report by CintroArias (2014) who attempted to match an intricate infectious disease model to view count data for the video Gangnam Style.
We, too, investigate the attention dynamics of viral videos from the point of view of mathematical epidemiology and present results based on a data set of more than 800 time series.
Our contributions are of theoretical and empirical nature, namely: 1) we introduce a simple yet expressive probabilistic model of the dynamics of epidemics; in contrast to traditional approaches, our model admits a closed form expression for the evolution of infected counts and we show that it amounts to the convolution of two geometric distributions 2) we introduce a time continuous characterization of this result; major advantages of this continuous model are that it is analytically tractable and allows for the use of highly robust maximum likelihood techniques in model fitting as well as for easily interpretable results 3) we fit our model to YouTube view count data and Google Trends time series which reflect collective attention to prominent viral videos and find it to fit well.
Our work therefore constitutes a data scientific approach towards viral video research.
However, it is model- rather than data driven.
This way, we follow arguments brought forth, for instance, by Bauckhage et al. (2013) or Lazer et al. (2014) who criticized the lack of interpretability and the ‘big data hubris’ of purely data driven approaches for their potential of over-fitting and misleading results.
Our presentation proceeds as follows: Next, we review concepts from mathematical epidemiology, briefly discuss approaches based on systems of differential equations, and introduce the probabilistic model that forms the basis for our study; mathematical details behind this model are deferred to the Appendix.
Then, we present the data we analyzed and discuss our empirical results.
We conclude by summarizing our approach, results, and implications of our findings.

But yeah, cleaning up the text before sentence tokenizer is not that difficult, just look for some generic patterns that breaking the tokenizer. I hope you get the general idea from the above examples.

So the hacks work, but it only work for this dataset. How do I generalize the hack then? The only solution is to retrain a punkt tokenizer to get a sentence tokenizer that is specific to academic texts, see training data format for nltk punkt

But do note that you might have to have a small set of sentence tokenized text to train the tokenizer. Have fun!

Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
0

Why does SO not send me mails when people answer these... Anyway. I kept investigating the issue and eventually stumbled upon a google answer, to use the NLTK feature and add more "known" abbrev_types:

tokenizer._params.abbrev_types.update(extra_abbreviations)

extra_abbreviations being language and context specific. But even using ['e.g', 'al', 'i.e'] dramatically improves the result in my case, which really makes me wonder why the "trained" english pickle does not appear to contain these.