0

I have two list objects: wiki_text and corpus. wiki_text is made up of small phrases and corpus is made up of long sentences.

wiki_text = ['never ending song of love - ns.jpg',
 'ecclesiological society',
 "1955-56 michigan wolverines men's basketball team",
 'sphinx strix',
 'petlas',
 '1966 mlb draft',
 ...]

corpus = ['Substantial progress has been made in the last twenty years',
          'Patients are at risk for prostate cancer.',...]

My goal is to create a filter which can filter out elements in wiki_text that is a substring of the elements in corpus. For example, if 'ecclesiological society' exists as part of a sentence in corpus, it should be kept in the final result. The final result should be a subset of the original wiki_text. The following code is what I used before:

def wiki_filter(wiki_text, corpus):
    result = []
    for i in wiki_text:
        for e in corpus:
            if i in e:
                result.append(i)
                break
    return result

However, given the length of wiki_text and corpus (each > 10 million). This function took extremely long hours to run. Is there any better way to solve this problem?

cs95
  • 379,657
  • 97
  • 704
  • 746
James Chang
  • 608
  • 8
  • 21
  • Could you split your corpus and run your function on different threads? (https://stackoverflow.com/questions/2846653/how-to-use-threading-in-python#28463266) – aloisdg Sep 07 '18 at 05:58
  • I personally am not an expert on search algorithms, the reason why it takes so long is because you are looking at O(N^2) and dealing with large data sets. I wonder if you could speed it up by implementing some form of Bloom Filter algorithm, hashing etc. i.e. hash every word in the corpus and check to see if all words in each wiki_text exist before doing a brute force search to see if they exist in the corpus in particular order. That would hopefully get rid of many wiki_texts elements in terms of even needing to search for them (based on preclusion of certain words)? – Jesse Sep 07 '18 at 06:02

3 Answers3

2

Let's see if flashtext can help here.

First, pip install flashtext, and then build a KeywordProcessor object and call extract_keywords to filter out your strings.

from flashtext import KeywordProcessor
keyword_processor = KeywordProcessor()
for w in wiki_text:
    keyword_processor.add_keyword(w)

filtered_corpus = [c for c in corpus if keyword_processor.extract_keywords(c)]

Unfortunately, the flashtext API doesn't yet have a "has_keyword" method, so you will need to test the truthiness of the temp list that extract_keywords returns and then subsequently discard it. If you're upto it, you can contribute to the project on GitHub.

cs95
  • 379,657
  • 97
  • 704
  • 746
0

To make it really fast I would suggest a bit of an unorthodox approach namely using Lucene (PyLucene if you are forced to only use python).

Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. PyLucene is a Python extension for accessing Java LuceneTM. Its goal is to allow you to use Lucene's text indexing and searching capabilities from Python.

This is how I would do it: Index the corpus sentences, each sentence in separate record. Then using the search functionality of Lucene, search for each of the phrases in your wiki_text using a string query.

Now, this approach is not the easiest and most straight forward one to use but it would be one of the fastest in my opinion. You would probably be able to perform millions of searches (wiki_text phrases) in million of records (corpus) in a matter of minutes. So if @coldspeed 's flashtext solution satisfies your needs, go for it, otherwise, give Lucene a try!

0

How fact can the regex engine work here?. You can try

import re
re.findall('|'.join(wiki_text),'\n'.join(corpus))
Onyambu
  • 67,392
  • 3
  • 24
  • 53