0

I've list of queries and list of documents like this

queries = ['drug dosage form development Society', 'new drugs through activity evaluation of some medicinally used plants', ' Evaluation of drugs available on market for their quality, effectiveness']
docs = ['A Comparison of Urinalysis Technologies for Drugs Testing in Criminal Justice', 'development society and health care', 'buying drugs on the market without prescription may results in death', 'plants and their contribution in pharmacology', 'health care in developing countries']

I want to print document as related one if at least one similar word exists in both query and document. I've tried this code based on one answer of python: finding substring within a list post. but it did not work.

query = [subquery for subquery in queries]
for i in query:
    sub = i
    for doc in docs:
        if str(i) in docs:
            print docs

any help is appreciable

Community
  • 1
  • 1

2 Answers2

0

Your code(for i in query:) is searching for sentence not words. To search for words, first you have to split query sentence into words.

for q in queries:
    for word in q.strip().split(" "):
        print word

Complete code:

for q in queries:
    for word in q.strip().split(" "):
        for doc in docs:
            if word in doc:
                print doc

Note: above code will also search for in, for, of, on etc in doc

Ravi Kumar
  • 1,769
  • 2
  • 21
  • 31
0

An efficient way of doing this would be to build an Inverted Index. The one I've implemented below is a dirty inverted index.

words = {}
for index, doc in enumerate(docs):
    for word in doc.split(" "): 
        if not word or word==" ":
            pass
        elif not word in words: words[word]=[index]
        elif index not in words[word]: words[word].append(index)

for query in queries: 
    matches = []
    map(lambda x: matches.extend(words[x]), filter(lambda x: x in query, words))
    print list(set(matches))

In an ideal world, your code would also include

  • Stopwords - words that shouldn't be indexed, such as "for" or "the" from the documents.
  • Stemming - mapping a word to its stem allowing for alternate grammatical searches. For instance, running --> run, runs --> run, runner --> run. Thus, using any of the terms would bring documents that contained the word run with all it's forms.
  • Synonyms - look up synonyms in Wordnet or similar databases. Eg. vehicle would also bring up documents containing the word "car".
  • Relevance Ranking - documents retrieved can be ranked per frequency of the search term with respect to the total number of words in the document.

All of the above can be added as additional modules on the index and the search engine you're creating as per need.

SashaZd
  • 3,315
  • 1
  • 26
  • 48
  • if I don't mistaken, your code prints indexes at which similar words exists in query (word[0] of query[0] is match with word[1] of query[1]) based on given example. but i want to print doc in docs containing word that query is containing. for example only doc 'plants and their contribution in pharmacology' is not printed for query 1(query[0]) since no similar word is there – Misganu Fekadu Jun 20 '16 at 09:13
  • My code prints the Doc index when the doc matches the query. You can use the index of the doc to retrieve the appropriate doc from the list. – SashaZd Jun 22 '16 at 16:09
  • so how can I print it just like this? for i in list(set(matches)): print docs[i] – Misganu Fekadu Jun 27 '16 at 08:40
  • I've investigate two problems when I apply this code for i in list(set(matches)): print docs[i] 1) it did not give any document if query is only one. that means I've to have more than one query in queries. 2) it prints doc repeatedly until it reaches end of related doc. check it with my code in this comment and if any solution? – Misganu Fekadu Jun 27 '16 at 09:18
  • I've solved the problem of repetition by reducing length of docs[i] to one if len(docs[i]>1: continue – Misganu Fekadu Jun 28 '16 at 08:20