Questions tagged [penn-treebank]

The Penn Treebank Project annotates text for linguistic structure using Treebank II bracketing.

The Penn Treebank Project is located at University of Pennsylvania.

The Penn Treebank Project annotates naturally-occuring [sic] text for linguistic structure. Most notably, we produce skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees. We also annotate text with part-of-speech tags, and for the Switchboard corpus of telephone conversations, dysfluency annotation.

15 questions
10
votes
1 answer

how could I use complete penn treebank dataset inside python/nltk

I'm trying to learn using NLTK package in python. In particular, I need to use penn tree bank dataset in NLTK. As far as I know, If I call nltk.download('treebank') I can get the 5% of the dataset. However, I have a complete dataset in tar.gz file…
zwlayer
  • 1,752
  • 1
  • 18
  • 41
9
votes
2 answers

calculating perplexity for training LSTM on penn treebank

I'm implementing language model training on penn treebank. I'm adding loss for each timestep and then calculating perplexity. This gives me non-sensically high perplexity of hundreds of billions even after training for a while. Loss itself decreases…
ytrewq
  • 3,670
  • 9
  • 42
  • 71
3
votes
0 answers

Determine what tree bank type can come next

I am use Apache NLP and its POSTaggerME. I have it breaking down words into their Penn Treebank tag set values. Is there any functionality out there (doesn't have to be in Apache NLP) that lets you know what kind of word can come next using the…
user489041
  • 27,916
  • 55
  • 135
  • 204
2
votes
0 answers

Extracting Function Tags from Parsed Sentence (using Stanford Parser)

Looking at the Penn Treebank tagset (http://web.mit.edu/6.863/www/PennTreebankTags.html#RB) there is a section called "Function Tags" that would be extremely helpful for a project I am working on. I know the Stanford Parser uses the Penn Treebank…
jdsto
  • 455
  • 1
  • 4
  • 11
2
votes
4 answers

How to reduce the number of POS tags in Penn Treebank? - NLTK (Python)

I used nltk for part of speech tagging. It has 36 Penn Treebank. I want to reduce the number of tags to 6 :"noun, verb, adjective, adverb, preposition, conjunction" How should I do so? Is there any specific function attribute? or command?
1
vote
1 answer

Syntactical error when yacc file is called

I am trying to build an XTAG parser from source. The relevant files can be fetched from ftp://ftp.cis.upenn.edu/pub/xtag/lem. I understand that this particular TAG parser is decades old and there are plenty of newer options, but I need this specific…
aram10
  • 11
  • 2
1
vote
0 answers

How to extract the keywords on which universal sentence encoder was trained on?

I am using Universal sentence encoder to encode some documents into a 512 dimensional embeddings. These are then used to find similar items to a search query which is also encoded using USE. USE works pretty well on general english words in search…
1
vote
1 answer

How to convert from column-based CoNLL format to the Penn Treebank annotation style?

Does anybody know about any tool, script, etc. to convert from column-based CoNLL format to the Penn Treebank annotation style?
Tropin
  • 53
  • 6
1
vote
1 answer

How to generate sentiment treebank in Stanford NLP

I'm using Sentiment Stanford NLP library for sentiment analytics. Now I want to generate a treebank from a sentence input sentence: "Effective but too-tepid biopic" output tree bank: (2 (3 (3 Effective) (2 but)) (1 (1 too-tepid) (2 biopic))) Can…
lknguyen
  • 23
  • 6
1
vote
1 answer

Read complete penn treebank dataset from local directory

I have a complete penn treebank dataset and I want to read it using ptb from ntlk.corpus. But in here it is said that: If you have access to a full installation of the Penn Treebank, NLTK can be configured to load it as well. Download the ptb…
Wasi Ahmad
  • 35,739
  • 32
  • 114
  • 161
0
votes
1 answer

Part-of-Speech tagging: what is the difference between known words and unknown words?

I am trying to understand the result evaluation table (table 1) of this paper. There are three different accuracies reported overall, unknown words (UW), known words (KW), and percentage of unknown words (% unk.). Are the known words the data that…
AziZ
  • 149
  • 1
  • 12
0
votes
1 answer

Hebrew Stanford NLP tag set

I am trying to find the exact list of tag set used in the Hebrew treebank used by Stanford NLP. Finding this tag set seems to be harder than finding a POS tagger :) Are there any tools for reading the tag set used for training a (Penn?) tree bank?
rubmz
  • 1,947
  • 5
  • 27
  • 49
0
votes
1 answer

Entities containing underscore character are split into multiple entities by TokensAnnotation in CoreNLP

I am observing that coreNLP 3.9.2 has started splitting enti_ties into multiple ones like 'enti' , '_', 'ties' while tokenizing I have tried to use the tokenize.whitespace which solves this problem. But I think this will stop splitting tokens for…
0
votes
1 answer

how to learn language model?

I'm trying to train a language model with LSTM based on Penn Treebank (PTB) corpus. I was thinking that I should simply train with every bigram in the corpus so that it could predict the next word given previous words, but then it wouldn't be able…
ytrewq
  • 3,670
  • 9
  • 42
  • 71
-1
votes
1 answer

Finding span of each node in NLTK tree

I am new to nltk and finding it hard to deal with nltk tree. Given an nltk parsed tree from Penn treebank, I want to be able to count the span of each node recursively from bottom to up. Span of leaf nodes is 1. And the span of non terminal nodes is…