3

I am helping a professor of mine with a research project that involves pulling one thousand sentences randomly from a set of 20 text files. This is all data from the Corpus of Contemporary American English, if anyone is familiar with working with that. In these text files, the data is arranged like so:

Blockquote ##4000348 I must begin by saying this : In preparation for this lecture , I read ( or in some cases reread ) a number of the writings of Sidney Hook . I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook . But instead I found myself infused with a set of ideas that were relevant to a different setting , a different occasion .

##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College . That was the reason news of my appointment appeared in the Wall Street Journal and the National Review , which does n't usually happen to deans of Yale College , and does n't help them much when it does .

Blockquote>

So, there are hundreds of paragraphs, each starting with a six digit number preceded by "##". That number corresponds to the source where the sentences were drawn from. I need to pull random sentences from these files, and also get the six digit number identifying their source with them. So ideally, I would get something like:

Blockquote ##4000348 I read them solely to give me the right starting point for a lecture given in honor of Sidney Hook

##4000349 I would like to think I am best known for my wisdom and learning , but in truth such fame as I have derives from my being a reputed conservative who is also dean of Yale College .

I have succeeded in getting random sentences from the files (with some help from the kind souls here at stackoverflow), but I don't know how to get the number attached to them (for example, if I pull a sentence from the middle of a paragraph, how would I be able to get the number from the start of the paragraph). Can anyone help me think of a way to do this? This is the code I have so far, which successfully extracts sentences.

# -*- coding: utf-8 -*-

import re
from random import sample

sentences = []
for i in range(1990,2013):
    with open('w_acad_{}.txt'.format(i)) as f:
        sentences += re.findall(r".*?[\.\!\?]+", f.read())

selected = sample(sentences, 2000)
with open('out.txt', 'w') as f:
    f.write('\n'.join(selected))
K. Swan
  • 195
  • 3
  • 12

2 Answers2

4

Perhaps you could use regex to extract each paragraph along with it's source id, and then extract sentences from the paragraph, similarly to how you're doing it at the moment. This should help you catch the paragraph:

# with open... etc.
for source_id, paragraph in re.findall(r"(##\d+)([^#]+)", f.read()):
    sentences += [(source_id, sentence) for sentence in re.findall(r".*?[\.\!\?]+", paragraph)]

Now, sentences should be a list of tuples like ('##123', 'A sentence.'), from which you can sample same as before.

pzelasko
  • 2,082
  • 1
  • 16
  • 24
2

In general, to avoid loading (potentially large) files into memory all at once, you could use a reservoir sampling algorithm -- just pass it an iterator that yields labeled (with the ##-numbers) sentences:

#!/usr/bin/env python
import re
import nltk  # $ pip install nltk

def paragraphs(file):
    """Yield blank-line separated paragraphs labeled with ##-numbers."""
    lines = []
    for line in file:
        if line.strip():
            lines.append(line)
        elif lines:  # blank line, the end of a non-empty paragraph
            paragraph = ''.join(lines)
            numbers = re.findall(r'##([0-9]+)', paragraph)  # only ASCII-digits
            assert len(numbers) == 1  # only one ##-number per paragraph
            yield int(numbers[0]), paragraph
            del lines[:]

def sentences(filenames):
    for filename in filenames:
        with open(filename) as file:
            for number, paragraph in paragraphs(file):
                for sentence in nltk.sent_tokenize(paragraph):
                    yield number, sentence

filenames = ('w_acad_%d.txt' % n for n in range(1990, 2013))
print(reservoir_sample(sentences(filenames), 2000))

where reservoir_sample() is defined here.

nltk.sent_tokenize() may be a more robust solution than the r".*?[\.\!\?]+" regular expression.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Hey, this answer is awesome! I am so sorry that I took so long to respond, I've been a little busy and haven't been working on this project. I used your code, including the reservoir_sample() definition you gave me, and the code compiles, but all it prints is this: '[]' Do you know why this is? – K. Swan Mar 30 '16 at 19:01
  • @K.Swan the input format that the code expects might differ from the actual one. You should modify `paragraphs()` function to match the actual format of your files. If you can't; create a minimal example input, describe the assumptions (e.g., is it correct that there is a blank line between paragraphs? Is it correct that there is exactly one ##-number per paragraph?), and post it as a new question if you are interested (in practice, a simple regex-based solution such as in @pzelasko answer should be enough). – jfs Mar 30 '16 at 19:21