Questions tagged [text-analysis]

Text Analysis is an area of study where one uses linguistic, statistical and machine learning tools to analyze a text in order to extract some high quality information from it.

429 questions
81
votes
4 answers

Stemmers vs Lemmatizers

Natural Language Processing (NLP), especially for English, has evolved into the stage where stemming would become an archaic technology if "perfect" lemmatizers exist. It's because stemmers change the surface form of a word/token into some…
alvas
  • 115,346
  • 109
  • 446
  • 738
72
votes
4 answers

How to extract common / significant phrases from a series of text entries

I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com,…
arronsky
  • 721
  • 1
  • 6
  • 3
57
votes
6 answers

Training data for sentiment analysis

Where can I get a corpus of documents that have already been classified as positive/negative for sentiment in the corporate domain? I want a large corpus of documents that provide reviews for companies, like reviews of companies provided by analysts…
London guy
  • 27,522
  • 44
  • 121
  • 179
23
votes
1 answer

How to find common phrases in a large body of text

I'm working on a project at the moment where I need to pick out the most common phrases in a huge body of text. For example say we have three sentences like the following: The dog jumped over the woman. The dog jumped into the car. The dog jumped…
benmcredmond
  • 1,702
  • 2
  • 15
  • 22
20
votes
3 answers

How to remove stopwords efficiently from a list of ngram tokens in R

Here's an appeal for a better way to do something that I can already do inefficiently: filter a series of n-gram tokens using "stop words" so that the occurrence of any stop word term in an n-gram triggers removal. I'd very much like to have one…
Ken Benoit
  • 14,454
  • 27
  • 50
17
votes
2 answers

tag generation from a small text content (such as tweets)

I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords). And it seems like the accepted suggestion ( point-wise mutual information…
Hellnar
  • 62,315
  • 79
  • 204
  • 279
15
votes
5 answers

Error using langdetect in python: "No features in text"

Hey I have a csv with multilingual text. All I want is a column appended with a the language detected. So I coded as below, from langdetect import detect import csv with open('C:\\Users\\dell\\Downloads\\stdlang.csv') as csvinput: with…
user7140275
  • 215
  • 1
  • 3
  • 9
15
votes
3 answers

Use brain.js neural network to do text analysis

I'm trying to do some text analysis to determine if a given string is... talking about politics. I'm thinking I could create a neural network where the input is either a string or a list of words (ordering might matter?) and the output is whether…
Andrew Rasmussen
  • 14,912
  • 10
  • 45
  • 81
15
votes
1 answer

Trying to get tf-idf weighting working in R

I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text…
cforster
  • 577
  • 2
  • 7
  • 19
15
votes
1 answer

Very simple text classification by machine learning?

Possible Duplicate: Text Classification into Categories I am currently working on a solution to get the type of food served in a database with 10k restaurants based on their description. I'm using lists of keywords to decide which kind of food is…
Dieter
  • 441
  • 1
  • 5
  • 15
14
votes
5 answers

Check if a string is a possible abbrevation for a name

I'm trying to develop a python algorithm to check if a string could be an abbrevation for another word. For example fck is a match for fc kopenhavn because it matches the first characters of the word. fhk would not match. fco should not match fc…
Björn Lindqvist
  • 19,221
  • 20
  • 87
  • 122
14
votes
4 answers

Extract words from PDF with golang?

I don't understand type conversion. I know this isn't right, all I get is a bunch of hieroglyphs. f, _ := os.Open("test.pdf") defer f.Close() io.Copy(os.Stdout, f) I want to work with the strings....
omgj
  • 1,369
  • 3
  • 12
  • 18
13
votes
2 answers

How to combine TFIDF features with other features

I have a classic NLP problem, I have to classify a news as fake or real. I have created two sets of features: A) Bigram Term Frequency-Inverse Document Frequency B) Approximately 20 Features associated to each document obtained using pattern.en…
Massifox
  • 4,369
  • 11
  • 31
13
votes
2 answers

Wordcloud is cropping text

I am using twitter API to generate sentiments. I am trying to generate a word-cloud based on tweets. Here is my code to generate a wordcloud wordcloud(clean.tweets, random.order=F,max.words=80, col=rainbow(50), scale=c(3.5,1)) Result for this: I…
Harsh Shah
  • 2,162
  • 2
  • 19
  • 39
13
votes
3 answers

Java text analysis libraries

I'm looking for a java driven solution to a requirement for analysing sentences to log whether a key word was used positively or negatively. Ie The key word might be 'cabbages' and the sentence:- 'I like cabbages but not peas' And I'd like a java…
jaseFace
  • 1,415
  • 5
  • 22
  • 34
1
2 3
28 29