2

I have recently been trying to create a Python program to which you give a word, and it lists out all of its synonyms. Here is the code I'm using:

from urllib import quote_plus
import urllib2
import re

def get_search_result(key):
    page = urllib2.urlopen('http://www.synonyms.net/synonym/%s'%quote_plus(key)).read()
    words_ = []
    words = []
    for i in [re.sub('<.*?>', '', i) for i in re.findall('Synonyms:&nbsp;(.*?)Antonyms', page)]:
        words_.extend(i.split(', '))
    for i in words_:
        if i not in words:
            words.append(i)
    return words

if __name__ == '__main__':
    res = get_search_result('sack')
    print res, len(res)

The thing is, while it works, it is INCREDIBLY slow. It took a minute for it to answer for me. My question: is there a better way of doing this? Right now, it uses synonyms.net and checks the html of the page. The problem is, synonyms.net in itself is slow.

I have looked into the synonym.net API. It seemed to be exactly what I needed, as it was very fast (returned the list in 0.23 seconds). The only problem is that, at the bottom of the page, in small print, it says 'The Synonyms API service is free to use for up to 1,000 queries per day'. Now, that is circumvented, as they say, if you buy the product. The problem is that buying something requires money, and I don't really want to pay $10 a month for a program to give me synonyms.

I have also looked into http://thesaurus.com. Because the code is flexible, I modified it quickly to use that. It was better, taking only 10 seconds to respond. However, that is still not suitable. Thesaurus.com does not have an API to use, as far as a quick search on the website proved. Now, the final solution, the one that would be guaranteed to work, would be to make my own synonym list, and then have a program to parse it. However, this option seems messy and not very favorable. Does anyone have any alternatives, that would at least be faster then 10 seconds?

Thanks in advance!

Xyene
  • 2,304
  • 20
  • 36
  • My first piece of advice: [don't parse HTML with regular expressions](http://stackoverflow.com/a/1732454/960195). There are plenty of Python HTML parsing libraries available. – Adam Mihalcin Mar 19 '12 at 23:43
  • 2
    Surely synonyms won't change very quickly in meaningful ways. So why not purchase a text document that contains this info and do all your queries on that single text file? Also, if you need to scale up to perform synonym queries for lots of words, it seems like the perfect thing to write using [MRJob](https://github.com/Yelp/mrjob). – ely Mar 19 '12 at 23:46
  • So, you want the web site to provide a service to your application, which generates no revenue for the site's owners, without charging you anything? I'm not saying that finding synonyms of words is a major task (it's really not), but nothing's free. You should look into using an offline dictionary instead of an online service. – Borealid Mar 19 '12 at 23:49
  • @ AdamMihalcin and @ EMS Thanks, but the gist of the program (which is for homework) is to do it in under 100kb. I could easily rip a copy of synonyms.net, or use the MSWORD dlls, but that would require almost surely over 100kb. That is also why I am refraining from using any more libraries that are absolutely necessary. – Xyene Mar 19 '12 at 23:49
  • 2
    So you're looking for a free dictionary. I don't see how that's a Python question. – Manish Mar 19 '12 at 23:50
  • @ Boralid Yes, I know I sound cheap, but using an offline dictionary would be too fat, and paying $10 for a 'simple' program doesn't sound like such a good deal. – Xyene Mar 19 '12 at 23:51
  • @EMS In what possible way is this task a good fit for the map/reduce paradigm? It's a straight relational join (word list to synonyms ON word). That's the absolute worst thing to do with an MR framework - numerous papers have been written on how to make it work at all, and pretty much every performance-sensitive application does replication joins instead of true MR joins. – Borealid Mar 19 '12 at 23:51
  • @ Manish True, but if someone saw the Python code and said "Hey, he's not doing it the most efficient way!", then I would know that the way to fix it would be to fix the Python code. – Xyene Mar 19 '12 at 23:52
  • @Nox If the problem is the inclusion of the dictionary with the program, why don't you put the dictionary on your own server? Then you're paying the exact minimum that it costs to provide the services your application needs. – Borealid Mar 19 '12 at 23:53
  • @ EMS Yes, it is hell. And yes, 100kb of RAM, and no additional files except the one Python program. 100kb is easy to achieve if using an online dictionary. – Xyene Mar 19 '12 at 23:53
  • @Nox 100kb is a ridiculously large amount of (compressed) text. You couldn't get a really detailed thesaurus in there, but a basic synonym finder for common English words would definitely fit. – Borealid Mar 19 '12 at 23:55
  • @ Borealid Thanks! I hadn't thought of that. I do have a server, and putting the dictionary on it wouldn't be too hard. But there's a back draw: I was exaggerating greatly when I said that it was an easy job to rip a copy off of some website. I can't even think of how to begin doing it. – Xyene Mar 19 '12 at 23:55
  • @EMS And what is the reduce job? That is to say, what makes it a good fit for *the map/reduce framework* instead of just splitting the file into chunks and processing them in parallel (which is what a mapper alone does)? – Borealid Mar 19 '12 at 23:56
  • @ Borealid That would be a temporary solution, but only, as you said, for common words. As this program is supposed to be "professional", it has to be able to answer synonyms for words such as 'truancy'. And plus, we are only allowed ONE file. Unless I were to embed the entire dictionary inside the Python script itself. But then it couldn't be compressed... – Xyene Mar 19 '12 at 23:57
  • @Nox http://en.wikipedia.org/wiki/WordNet will probably do. – Borealid Mar 19 '12 at 23:58
  • @ Borealis Interesting... I have looked into that before, but I cannot seem to get it working. When searching up "sack", which is a pretty common word, it says it cannot find an entry, and that the browser is limited to the starting site. – Xyene Mar 20 '12 at 00:00
  • 2
    @Nox I'm not sure what you mean by "the browser". You want to download the WordNet database and then use it in your application (doesn't matter if it's on the server or client, either way). See something like http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html . – Borealid Mar 20 '12 at 00:02
  • @ Borealis I retract my statement, it DOES work. However, it has way too many redirects for html parsing to be of any use whatsoever. – Xyene Mar 20 '12 at 00:03
  • I see. But that still violates the no-external file rule. – Xyene Mar 20 '12 at 00:04
  • On the non-python side if you opt for internet-base solution and thesaurus.com it has a m-version of its site, http://m.dictionary.com/t/ it might both be faster retrieved and also it will be a simpler document to parse. – deinonychusaur Mar 20 '12 at 03:02
  • @deinonychusaur Thanks! You might want to post that as the answer, because I think it works. Retrieves word in 1 second! Thanks for everyone who commented, you helped me alot! – Xyene Mar 20 '12 at 03:25

1 Answers1

0

Reposting my comment since it seems to fix the issue,

thesaurus.com also has an m-version at m.dictionary.com/t, using it should speed up the internet traffic and using mobile-versions also makes the parsing of the HTML much much easier.

deinonychusaur
  • 7,094
  • 3
  • 30
  • 44