2

I need to add begin and end sentence markers to some texts that I analyze using Quanteda.
I would like to add these markers using Quanteda but I do not see an explicit way to do that "out of the box".
Searching for an answer I found a different question about quanteda and these markers here. Another question about markers here strengthens my guess that this task is done "manually".

This is to ask what is currently the best way to add such markers using Quanteda and what advantages ("NLP intelligence" ?) and disadvantages (lower speed, memory) it would have compared to doing that in custom code.

I am mostly interested in the general answer, any additional advice about the specifics of my case are most welcome, they are:

  • Texts size: very large, for instance when trying to segment texts to sentences Quanteda was still running after 2-3 hours and I always had to kill the session.

  • I would like to use Quanteda but not at all costs, I am comfortable coding in R, Python, Java and with regexes and if other non-huge packages bring relevant advantages I have no problems in learning and using them for this task (text2vec?).


    Sample of input and desired output.
    Using "sss" and "eee" as begin and end sentence markers:
    input:
    CENTERS FOR DISEASE CONTROL AND PREVENTION (CDC). Outbreak of influenza A in a nursing home - New York, Dec. 1991-Jan. 1992. MMWR Morb Mortal Wkly Rep 1992; 18: 129-31.
    desired output:
    sss CENTERS FOR DISEASE CONTROL AND PREVENTION (CDC) eee sss Outbreak of influenza A in a nursing home - New York, Dec. 1991-Jan. 1992 eee sss MMWR Morb Mortal Wkly Rep 1992; 18: 129-31 eee
user778806
  • 67
  • 6
  • 1
    Could you provide a simple example of your input and desired ouput. Regarding markers, usually a simple dot is used as marker to separate sentences. The dowsnstream tool you use is independent from how you split sentences. You might have a look at the examples provided in the `tokenizers` package (which uses `stringi`under the hood for splitting etc.). Another question is, do you really need sentences as is or are you going to set up a document term matrix? In the latter case you can process the text files in chunks to grow the dtm step by step (e.g. with text2vec). – Manuel Bickel Aug 02 '18 at 10:39
  • My goal is to generate, as correctly as possible, ngrams for next word prediction. Very large texts. Texts on a single line (some lines longer than 10.000 chars). With more ordinary text sizes I think that the ideal and normal thing would be to use Quanteda to segment to sentences but due to the size of the texts the segmentation never ended on my laptop (16GB RAM, Intel I7 4 cores 8 threads). Writing this it emerges into consciousness my wild and probably wrong assumption that Quanteda would not manage correctly ngram generation without segmenting to sentences. Will verify now. – user778806 Aug 02 '18 at 11:29
  • 1
    How are you approaching the next word prediction? Depending on your approach you might not need to extract the sentences but only the ngrams and use statistics on the ngrams.Furthermore, please always provide data example in your answer, not in the comments, so that examples are more visible to all users reading your question. – Manuel Bickel Aug 02 '18 at 11:43
  • Would like to only use ngrams. Though on a large training text it might be less important, it seems to me that awareness of sentence structure, via markers or other means, is useful to generate ngrams more correctly (ie. not to build ngrams across sentence boundaries). Moving sample into question. – user778806 Aug 02 '18 at 12:10
  • 1
    You are right on the one hand, but the ngrams over sentences boundaries will be the ngrams with very low likelihood anyway, and will thus be irrelevant for prediction. But in any case splitting text into sentences by using the dot as marker will suffice from my perspective. Then simply build your vocabulary from sentences. – Manuel Bickel Aug 02 '18 at 12:21
  • Ok, thanks for your advice. – user778806 Aug 02 '18 at 12:41

0 Answers0