Questions tagged [language-model]
266 questions
21
votes
4 answers
word2vec - what is best? add, concatenate or average word vectors?
I am working on a recurrent language model. To learn word embeddings that can be used to initialize my language model, I am using gensim's word2vec model.
After training, the word2vec model holds two vectors for each word in the vocabulary: the…

Lemon
- 1,394
- 3
- 14
- 24
20
votes
5 answers
How to compute skipgrams in python?
A k skipgram is an ngram which is a superset of all ngrams and each (k-i )skipgram till (k-i)==0 (which includes 0 skip grams). So how to efficiently compute these skipgrams in python?
Following is the code i tried but it is not doing as…

stackit
- 3,036
- 9
- 34
- 62
18
votes
2 answers
Character-Word Embeddings from lm_1b in Keras
I would like to use some pre-trained word embeddings in a Keras NN model, which have been published by Google in a very well known article. They have provided the code to train a new model, as well as the embeddings here.
However, it is not clear…

chase
- 3,592
- 8
- 37
- 58
18
votes
3 answers
ARPA language model documentation
Where can I find documentation on ARPA language model format?
I am developing simple speech recognition app with pocket-sphinx STT engine. ARPA is recommended there for performance reasons.
I want to understand how much can I do to adjust my…

Lukasz
- 19,816
- 17
- 83
- 139
17
votes
2 answers
Building openears compatible language model
I am doing some development on speech to text and text to speech and I found the OpenEars API very useful.
The principle of this cmu-slm based API is it uses a language model to map the speech listened by the iPhone device. So I decided to find a…

harshalb
- 6,012
- 13
- 56
- 92
14
votes
2 answers
Creating ARPA language model file with 50,000 words
I want to create an ARPA language model file with nearly 50,000 words. I can't generate the language model by passing my text file to the CMU Language Tool. Is any other link available where I can get a language model for these many words?

Vipin
- 4,718
- 12
- 54
- 81
12
votes
1 answer
TensorFlow Embedding Lookup
I am trying to learn how to build RNN for Speech Recognition using TensorFlow. As a start, I wanted to try out some example models put up on TensorFlow page TF-RNN
As per what was advised, I had taken some time to understand how word IDs are…

VM_AI
- 1,132
- 4
- 13
- 25
11
votes
2 answers
NLTK package to estimate the (unigram) perplexity
I am trying to calculate the perplexity for the data I have. The code I am using is:
import sys
sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")
from nltk.corpus import brown
from nltk.model import NgramModel
from…

Ana_Sam
- 469
- 2
- 4
- 12
10
votes
2 answers
Python interface to ARPA files
I'm looking for a pythonic interface to load ARPA files (back-off language models) and use them to evaluate some text, e.g. get its log-probability, perplexity etc.
I don't need to generate the ARPA file in Python, only to use it for querying.
Does…

Beka
- 725
- 6
- 22
8
votes
1 answer
calculate perplexity in pytorch
I've just trained an LSTM language model using pytorch. The main body of the class is this:
class LM(nn.Module):
def __init__(self, n_vocab,
seq_size,
embedding_size,
…

P.Alipoor
- 178
- 1
- 2
- 11
8
votes
5 answers
Which model (GPT2, BERT, XLNet and etc) would you use for a text classification task? Why?
I'm trying to train a model for a sentence classification task. The input is a sentence (a vector of integers) and the output is a label (0 or 1). I've seen some articles here and there about using Bert and GPT2 for text classification tasks.…

khemedi
- 774
- 3
- 9
- 19
7
votes
2 answers
Pretraining a language model on a small custom corpus
I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text.
For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language…

ysig
- 447
- 4
- 18
6
votes
1 answer
Using custom beam scorer in TensorFlow CTC (language model)
Is it possible to customize beam scorer in TensorFlow CTC implementation from Python side? I see this possibility in comment for CTCBeamSearchDecoder C++ class constructor but wonder how to provide this functionality for Python users?
Specific issue…

Maksym Diachenko
- 552
- 1
- 4
- 11
5
votes
0 answers
Starcoder finetuning - How to select the GPU and how to estimate the time it will take to finetune
I'd like to finetune Starcoder (https://huggingface.co/bigcode/starcoder) on my dataset and on a GCP VM instance.
It's says in the documentation that for training the model, they used 512 Tesla A100 GPUs and it took 24 days.
I also saw the model…

Aadesh
- 403
- 3
- 13
5
votes
0 answers
Is there a particular range for good perplexity value in NLP?
I'm fine-tuning a language model and am calculating training and validation losses along with the training and validation perplexities. It s calculated by taking the exponential of the loss, in my program. I'm aware that lower perplexities represent…

Dilrukshi Perera
- 917
- 3
- 17
- 31