Questions tagged [language-model]

266 questions
21
votes
4 answers

word2vec - what is best? add, concatenate or average word vectors?

I am working on a recurrent language model. To learn word embeddings that can be used to initialize my language model, I am using gensim's word2vec model. After training, the word2vec model holds two vectors for each word in the vocabulary: the…
Lemon
  • 1,394
  • 3
  • 14
  • 24
20
votes
5 answers

How to compute skipgrams in python?

A k skipgram is an ngram which is a superset of all ngrams and each (k-i )skipgram till (k-i)==0 (which includes 0 skip grams). So how to efficiently compute these skipgrams in python? Following is the code i tried but it is not doing as…
stackit
  • 3,036
  • 9
  • 34
  • 62
18
votes
2 answers

Character-Word Embeddings from lm_1b in Keras

I would like to use some pre-trained word embeddings in a Keras NN model, which have been published by Google in a very well known article. They have provided the code to train a new model, as well as the embeddings here. However, it is not clear…
chase
  • 3,592
  • 8
  • 37
  • 58
18
votes
3 answers

ARPA language model documentation

Where can I find documentation on ARPA language model format? I am developing simple speech recognition app with pocket-sphinx STT engine. ARPA is recommended there for performance reasons. I want to understand how much can I do to adjust my…
Lukasz
  • 19,816
  • 17
  • 83
  • 139
17
votes
2 answers

Building openears compatible language model

I am doing some development on speech to text and text to speech and I found the OpenEars API very useful. The principle of this cmu-slm based API is it uses a language model to map the speech listened by the iPhone device. So I decided to find a…
harshalb
  • 6,012
  • 13
  • 56
  • 92
14
votes
2 answers

Creating ARPA language model file with 50,000 words

I want to create an ARPA language model file with nearly 50,000 words. I can't generate the language model by passing my text file to the CMU Language Tool. Is any other link available where I can get a language model for these many words?
Vipin
  • 4,718
  • 12
  • 54
  • 81
12
votes
1 answer

TensorFlow Embedding Lookup

I am trying to learn how to build RNN for Speech Recognition using TensorFlow. As a start, I wanted to try out some example models put up on TensorFlow page TF-RNN As per what was advised, I had taken some time to understand how word IDs are…
VM_AI
  • 1,132
  • 4
  • 13
  • 25
11
votes
2 answers

NLTK package to estimate the (unigram) perplexity

I am trying to calculate the perplexity for the data I have. The code I am using is: import sys sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk") from nltk.corpus import brown from nltk.model import NgramModel from…
Ana_Sam
  • 469
  • 2
  • 4
  • 12
10
votes
2 answers

Python interface to ARPA files

I'm looking for a pythonic interface to load ARPA files (back-off language models) and use them to evaluate some text, e.g. get its log-probability, perplexity etc. I don't need to generate the ARPA file in Python, only to use it for querying. Does…
Beka
  • 725
  • 6
  • 22
8
votes
1 answer

calculate perplexity in pytorch

I've just trained an LSTM language model using pytorch. The main body of the class is this: class LM(nn.Module): def __init__(self, n_vocab, seq_size, embedding_size, …
P.Alipoor
  • 178
  • 1
  • 2
  • 11
8
votes
5 answers

Which model (GPT2, BERT, XLNet and etc) would you use for a text classification task? Why?

I'm trying to train a model for a sentence classification task. The input is a sentence (a vector of integers) and the output is a label (0 or 1). I've seen some articles here and there about using Bert and GPT2 for text classification tasks.…
khemedi
  • 774
  • 3
  • 9
  • 19
7
votes
2 answers

Pretraining a language model on a small custom corpus

I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text. For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language…
6
votes
1 answer

Using custom beam scorer in TensorFlow CTC (language model)

Is it possible to customize beam scorer in TensorFlow CTC implementation from Python side? I see this possibility in comment for CTCBeamSearchDecoder C++ class constructor but wonder how to provide this functionality for Python users? Specific issue…
Maksym Diachenko
  • 552
  • 1
  • 4
  • 11
5
votes
0 answers

Starcoder finetuning - How to select the GPU and how to estimate the time it will take to finetune

I'd like to finetune Starcoder (https://huggingface.co/bigcode/starcoder) on my dataset and on a GCP VM instance. It's says in the documentation that for training the model, they used 512 Tesla A100 GPUs and it took 24 days. I also saw the model…
5
votes
0 answers

Is there a particular range for good perplexity value in NLP?

I'm fine-tuning a language model and am calculating training and validation losses along with the training and validation perplexities. It s calculated by taking the exponential of the loss, in my program. I'm aware that lower perplexities represent…
1
2 3
17 18