5

I have been working on java to find the similarity between two documents. I prefer finding semantic similarity , but havent made efforts to find it yet . I am using the following approach .

  1. Extract terms / tokens (I am using JAWS with wordnet to remove synonyms thus improves the similarities )
  2. make a term document matrix
  3. LSA
  4. Cosine similarity

When i was looking at few stackoverflow pages , i got quite a few links to python implementations.

I would like to know if python is a better language to find the text similarity and would also like to know if i can find semantic similairty between two documents in python

CTsiddharth
  • 907
  • 12
  • 21
  • Everything you can do in Python, you can also do in Java (with enough work). That said, there exists [Natural Language Toolkit](http://www.nltk.org/) which is a Python library that provides a lot of tools for natural language processing. – Greg Hewgill Feb 13 '12 at 04:57

1 Answers1

2

Assuming you don't have a platform restriction that would constrain your choice of language, you should choose your language based on whatever you're most comfortable with (I prefer Python myself), and which has the best libraries for your application (as @GregHewgill pointed out the Python tools (Natural Language Toolkit) are mature and comprehensive).

So while I personally would choose Python, it's really something you have to choose for yourself.

== EDIT ==

This question about Java NLP libraries might help you decide if you can use Java for your analysis; the top answer has a list you can investigate. Without more information about your problem set, I can't provide more specific advice.

Community
  • 1
  • 1
ironchefpython
  • 3,409
  • 1
  • 19
  • 32
  • Thanks .. I have never worked on python earlier . But if it has so much of functionality , i thought i should shift to python and make use of it. So i wanted to know if it is going to be advantageous or do they give similar functionalities only – CTsiddharth Feb 13 '12 at 05:11
  • 1
    I find Python as a language to be more natural, and more expressive. **But really, it's about the libraries**. If I had a problem to solve, and the best libraries were Java-based, I'd use a JVM-based language. – ironchefpython Feb 13 '12 at 05:13
  • Thanks for the link . My project aims at ranking documents based on their similarity with a reference document . I aim at finding the most relevant document from a local repository. Since it has prospects on being used in real time , i want it to be as effective as possible . – CTsiddharth Feb 13 '12 at 05:48
  • If you are curious about libraries, here is a link to another Stackoverflow post outlining a few places to look for either Python or Java based code: http://stackoverflow.com/questions/22904025/java-or-python-for-natural-language-processing/ – Nathaniel Payne Apr 07 '14 at 05:40