Questions tagged [nlp]

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches.

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches. It is often regarded as the engineering arm of Computational Linguistics.

NOTE: If you want to use this tag for a question not directly concerning implementation, then consider posting on Data Science, or Artificial Intelligence instead; otherwise you're probably off-topic. Please choose one site only and do not cross-post to more than one - see Is cross-posting a question on multiple Stack Exchange sites permitted if the question is on-topic for each site? (tl;dr: no).

NLP tasks

Beginner books on Natural Language Processing

Popular software packages

20185 questions
464
votes
18 answers

How does the Google "Did you mean?" Algorithm work?

I've been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I've been really impressed with some search engines ability to very quickly respond to queries with "Did you mean: xxxx". I…
Andrew Harry
  • 13,773
  • 18
  • 67
  • 102
280
votes
14 answers

How to compute the similarity between two text documents?

I am looking at working on an NLP project, in any programming language (though Python will be my preference). I want to take two documents and determine how similar they are.
Reily Bourne
  • 5,117
  • 9
  • 30
  • 41
202
votes
14 answers

What is the difference between lemmatization vs stemming?

When do I use each ? Also...is the NLTK lemmatization dependent upon Parts of Speech? Wouldn't it be more accurate if it was?
TIMEX
  • 259,804
  • 351
  • 777
  • 1,080
197
votes
18 answers

googletrans stopped working with error 'NoneType' object has no attribute 'group'

I was trying googletrans and it was working quite well. Since this morning I started getting below error. I went through multiple posts from stackoverflow and other sites and found probably my ip is banned to use the service for sometime. I tried…
steveJ
  • 2,171
  • 3
  • 11
  • 16
184
votes
10 answers

Java Stanford NLP: Part of Speech labels?

The Stanford NLP, demo'd here, gives an output like this: Colorless/JJ green/JJ ideas/NNS sleep/VBP furiously/RB ./. What do the Part of Speech tags mean? I am unable to find an official list. Is it Stanford's own system, or are they using…
Nick Heiner
  • 119,074
  • 188
  • 476
  • 699
182
votes
16 answers

How to determine the language of a piece of text?

I want to get this: Input text: "ру́сский язы́к" Output text: "Russian" Input text: "中文" Output text: "Chinese" Input text: "にほんご" Output text: "Japanese" Input text: "العَرَبِيَّة" Output text: "Arabic" How can I do it in python?
Rita
  • 2,117
  • 3
  • 15
  • 15
173
votes
9 answers

What does tf.nn.embedding_lookup function do?

tf.nn.embedding_lookup(params, ids, partition_strategy='mod', name=None) I cannot understand the duty of this function. Is it like a lookup table? Which means to return the parameters corresponding to each id (in ids)? For instance, in the…
Poorya Pzm
  • 2,123
  • 3
  • 12
  • 9
162
votes
12 answers

How to get rid of punctuation using NLTK tokenizer?

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also…
lizarisk
  • 7,562
  • 10
  • 46
  • 70
156
votes
17 answers

Detecting syllables in a word

I need to find a fairly efficient way to detect syllables in a word. E.g., Invisible -> in-vi-sib-le There are some syllabification rules that could be used: V CV VC CVC CCV CCCV CVCC *where V is a vowel and C is a consonant. E.g., Pronunciation…
user50705
  • 1,623
  • 2
  • 11
  • 6
141
votes
4 answers

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

I'm working in a sentiment analysis problem the data looks like this: label instances 5 1190 4 838 3 239 1 204 2 127 So my data is unbalanced since 1190 instances are labeled with 5. For the classification Im…
133
votes
6 answers

How does Apple find dates, times and addresses in emails?

In the iOS email client, when an email contains a date, time or location, the text becomes a hyperlink and it is possible to create an appointment or look at a map simply by tapping the link. It not only works for emails in English, but in other…
Martin
  • 39,309
  • 62
  • 192
  • 278
129
votes
33 answers

spacy Can't find model 'en_core_web_sm' on windows 10 and Python 3.5.3 :: Anaconda custom (64-bit)

what is difference between spacy.load('en_core_web_sm') and spacy.load('en')? This link explains different model sizes. But i am still not clear how spacy.load('en_core_web_sm') and spacy.load('en') differ spacy.load('en') runs fine for me. But the…
user2543622
  • 5,760
  • 25
  • 91
  • 159
128
votes
1 answer

Difference between constituency parser and dependency parser

What is the difference between a constituency parser and a dependency parser? What are the different usages of the two?
RAVI
  • 3,143
  • 4
  • 25
  • 38
122
votes
6 answers

Understanding min_df and max_df in scikit CountVectorizer

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency of a word in its particular text file or is it the…
moeabdol
  • 4,779
  • 6
  • 44
  • 43
119
votes
2 answers

Java or Python for Natural Language Processing

I would like to know which programming language is better for natural language processing. Java or Python? I have found lots of questions and answers regarding about it. But I am still lost in choosing which one to use. And I want to know which NLP…
Jin Ling
  • 1,333
  • 2
  • 13
  • 16
1
2 3
99 100