Text mining with PHP

Question

I'm doing a project for a college class I'm taking.

I'm using PHP to build a simple web app that classify tweets as "positive" (or happy) and "negative" (or sad) based on a set of dictionaries. The algorithm I'm thinking of right now is Naive Bayes classifier or decision tree.

However, I can't find any PHP library that helps me do some serious language processing. Python has NLTK (http://www.nltk.org). Is there anything like that for PHP?

I'm planning to use WEKA as the back end of the web app (by calling Weka in command line from within PHP), but it doesn't seem that efficient.

Do you have any idea what I should use for this project? Or should I just switch to Python?

Thanks

Naive Bayesian classifiers are not really difficult to write yourself if you understand the basic principles. You could actually do everything in PHP that way. San Jacinto already covered everything I'd have said about the NLP part. One other thing I can tell you from a similar project I did just a couple of weeks ago is that sentiment classification using the standard bag-of-words approach doesn't really work very well. I didn't try anything like n-grams, though... I do have the feeling that they'd perform better, but of course that would give you tons of additional dimensions... — Jan Krüger, May 07 '10 at 07:33
There is no indication whatsoever in either your post or the one your linked to as to why this is a fitting solution. — San Jacinto, May 06 '10 at 23:38
PEAR's Text_LanguageDetect can identify 52 human languages from text samples and return confidence scores for each. Isn't this an interesting option to take into account? — nuqqsa, May 07 '10 at 16:28
@nuqqsa The question is about sentiment analysis, not language identification, and it asks for PHP, not Python. — jogojapan, Oct 28 '12 at 03:19
Take a look at this link to an article on Bayesian opinion mining on php/ir http://phpir.com/bayesian-opinion-mining It's a site that's well worth bookmarking — Mark Baker, May 07 '10 at 07:43

score 9 · Accepted Answer · answered May 06 '10 at 17:30

If you're going to be using a Naive Bayes classifier, you don't really need a whole ton of NL processing. All you'll need is an algorithm to stem the words in the tweets and if you want, remove stop words.

Stemming algorithms abound and aren't difficult to code. Removing stop words is just a matter of searching a hash map or something similar. I don't see a justification to switch your development platform to accomodate the NLTK, although it is a very nice tool.

score 5 · Answer 2 · answered Jan 26 '12 at 20:04

I did a very similar project a while ago - only classifying RSS news items instead of twitter - also using PHP for the front-end and WEKA for the back-end. I used PHP/Java Bridge which was relatively simple to use - a couple of lines added to your Java (WEKA) code and it allows your PHP to call its methods. Here's an example of the PHP-side code from their website:

<?php 
require_once("http://localhost:8087/JavaBridge/java/Java.inc");

$world = new java("HelloWorld");
echo $world->hello(array("from PHP"));
?>

Then (as someone has already mentioned), you just need to filter out the stop words. Keeping a txt file for this is pretty handy for adding new words (they tend to pile up when you start filtering out irrelevant words and account for typos).

The naive-bayes model has strong independent-feature assumptions, i.e. it doesn't account for words that are commonly paired (such as an idiom or phrase) - just taking each word as an independent occurrence. However, it can outperform some of the more complex methods (such as word-stemming, IIRC) and should be perfect for a college class without making it needlessly complex.

score 2 · Answer 3 · answered May 07 '10 at 12:06

You can also use the uClassify API to do something similar to Naive Bayes. You basically train a classifier as you would with any algorithm (except here you're doing it via the web interface or by sending xml documents to the API). Then whenever you get a new tweet (or batch of tweets), you call the API to have it classify them. It's fast and you don't have to worry about tuning it. Of course, that means you lose the flexibility you get by controlling the classifier yourself, but that also means less work for you if that in itself is not the goal of the class project.

score 1 · Answer 4 · answered Mar 11 '13 at 22:03

1

you can check this library https://github.com/Dachande663/PHP-Classifier very straight forward

answered Mar 11 '13 at 22:03

Yehia A.Salam

1,987
7
44
93

score 1 · Answer 5 · answered Jan 26 '12 at 07:46

1

Try open calais - http://viewer.opencalais.com/ . It has api, PHP classes and many more. Also, LingPipe for this task - http://alias-i.com/lingpipe/index.html

answered Jan 26 '12 at 07:46

Tirumal Rao

11
1

The former is a web interface, not a library (_if_ there is a library, too, please provide a link to that). The latter is a library, but for Java, not PHP. – jogojapan Oct 28 '12 at 03:20

score 0 · Answer 6 · answered Jul 21 '11 at 07:54

0

you can also use thrift or gearman to deal with nltk

answered Jul 21 '11 at 07:54

Omar

8,374
8
39
50

Text mining with PHP

6 Answers6

Linked