I'm trying to learn using NLTK package in python. In particular, I need to use penn tree bank dataset in NLTK. As far as I know, If I call nltk.download('treebank')
I can get the 5% of the dataset. However, I have a complete dataset in tar.gz file and I want to use it. In here it is said that:
If you have access to a full installation of the Penn Treebank, NLTK can be configured to load it as well. Download the ptb package, and in the directory nltk_data/corpora/ptb place the BROWN and WSJ directories of the Treebank installation (symlinks work as well). Then use the ptb module instead of treebank:
So, I opened the python from terminal, imported nltk and typed nltk.download('ptb')
. With this command, "ptb" directory has been created under my ~/nltk_data
directory. At the end, now I have ~/nltk_data/ptb
directory. Inside there, as suggested in the link I gave above, I've put my dataset folder. So this is my final directory hierarchy.
$: pwd
$: ~/nltk_data/corpora/ptb/WSJ
$: ls
$:00 02 04 06 08 10 12 14 16 18 20 22 24
01 03 05 07 09 11 13 15 17 19 21 23 merge.log
Inside all of the folders from 00 to 24, there are many .mrg
files such as wsj_0001.mrg , wsj_0002.mrg
and so on.
Now, Lets return my question. Again, according to here :
I should be able to obtain the file ids if I write the followings:
>>> from nltk.corpus import ptb
>>> print(ptb.fileids()) # doctest: +SKIP
['BROWN/CF/CF01.MRG', 'BROWN/CF/CF02.MRG', 'BROWN/CF/CF03.MRG', 'BROWN/CF/CF04.MRG', ...]
Unfortunately, when I type print(ptb.fileids())
I got empty array.
>>> print(ptb.fileids())
[]
Is there anyone who could help me ?
EDIT here is the content of my ptb directory and some of allcats.txt file :
$: pwd
$: ~/nltk_data/corpora/ptb
$: ls
$: allcats.txt WSJ
$: cat allcats.txt
$: WSJ/00/WSJ_0001.MRG news
WSJ/00/WSJ_0002.MRG news
WSJ/00/WSJ_0003.MRG news
WSJ/00/WSJ_0004.MRG news
WSJ/00/WSJ_0005.MRG news
and so on ..