I have downloaded the BLLIP corpus and would like to import it to NLTK. One way that I have found for doing this is described in the answer of the question How to read corpus of parsed sentences using NLTK in python?. In that answer they are doing it for one data file. I want to do it for a collection of them.
The BLLIP corpus comes as a collection of a few million files, each of which containing a couple of parsed sentences or so. The main folder that contains the data is named bllip_87_89_wsj
and it contains 3 subfolders, 1987
, 1988
, 1989
(one for each year). In subfolder 1987
you have sub-subfolders each containing a number of files corresponding to parsed sentences. A sub-subfolder is named something like w7_001
(for folder 1987
) and the file names are w7_001.000
, w7_001.001
and so on and so forth.
With all this at hand, my task is the following: Read all files sequentially using NLTK parsers. Then, convert the corpus to a list of lists, where each sublist is a sentence.
The second part is easy, its done with the command corpus_name.sents()
. It is the first part of the task that I don't know how to approach.
All suggestions are welcome. I would also especially welcome suggestions that propose alternative, more efficient, approaches to the one I have in mind.
UPDATE:
The parsed sentences of the BLLIP corpus are of the following form:
(S (NP (DT the) (JJ little) (NN dog)) (VP (VBD barked)))
In a number of sentences there is a syntactic category of the form (-NONE- *-0)
so when I read the corpus *-0
is considered a word. Is there a way to ignore the syntactic category -NONE-
. For example, if I had the sentence
(S (NP-SBJ (-NONE- *-0))
(VP (TO to)
(VP (VB sell)
(NP (NP (PRP$#0 its) (NN TV) (NN station))
(NN advertising)
(NN representation)
(NN operation)
(CC and)
(NN program)
(NN production)
(NN unit))
I would like it to become:
to sell its TV station advertising representation operation and program production unit
and NOT
*-0 to sell its TV station advertising representation operation and program production unit
which it is currently.