0

I have a bunch of files in a folder. Let's assume I convert all into plain text files.

I want to use python to perform searches like this:

query = '(word1 and word2) or (word3 and not word4)'

the actual logc varies, and multiple words can be used together. Another example:

query = '(shiny and glass and "blue car")'

Also the words are provided by the users so they are variables.

I want to display the sentences that matched and the filenames. This really does not need a complex search engine like whoosh or haystack which need to index files with fields. Also, those tools do not seem to have a boolean query as I explained above. I've come across pdfquery library which does exactly what I want for pdfs, but now I need that for text files and xml files.

Any suggestions?

max
  • 9,708
  • 15
  • 89
  • 144
  • is the query known to be safe? `eval` would provide an easy out here, but if this is user input it's exceedingly dangerous – Adam Smith Apr 28 '16 at 23:38
  • Is that query supposed to be interpreted with old-school search engine-like semantics, where `word` implicitly means "`word` is in the document"? – user2357112 Apr 28 '16 at 23:38
  • The user would type in the words and the semantics (AND,OR,NOT, parentheses). – max Apr 28 '16 at 23:40
  • 1
    @max then you're gonna have to write a lexer. Good luck! – Adam Smith Apr 28 '16 at 23:40
  • I don't want to re-invent the wheel. I have nothing against using whoosh or haystack, but after running my eyes through whoosh docs, I could not see any info or example of lexer. – max Apr 28 '16 at 23:47
  • Related: https://stackoverflow.com/questions/10281863/in-python-how-can-i-query-a-list-of-words-to-match-a-certain-query-criteria – Daniel F Nov 20 '18 at 17:05

3 Answers3

1

There's no easy way to say this, but this is not easy. You're trying to translate unsafe strings into executable code, so you can't take the easy way out and use eval. These aren't literals so you can't use ast.literal_eval either. You need to write a lexer that recognizes things like AND, NOT, OR, (, and ) and considers them something other than strings. On top you apparently need handle compound booleans, so this becomes quite a bit more difficult than you think it might be.

Your question asks about searching by sentence, which is not how Python operates. You'd have to write another lexer to get the data by-sentence instead of by-line. You'll need to read heavily into the io module to do this effectively. I don't know how to do it off-hand, but essentially you'll be looping while there is data to loop, reading a buffersize each iteration, and yielding when you reach a "\.(?=\s+)"

Then you'll have to run your first query lexer results through a set of list comprehensions, each one running across the results of the file lexer.

Adam Smith
  • 52,157
  • 12
  • 73
  • 112
0

I really needed to have such a solution so I made a python package called toned

I hope it will be useful to others as well.

max
  • 9,708
  • 15
  • 89
  • 144
0

Maybe I've answered this question too late, but I think the best way to solve complex boolean search expressions is using this implementation of Pyparsing

As you can see in its description all this cases are included:

SAMPLE USAGE:
from booleansearchparser import BooleanSearchParser
bsp = BooleanSearchParser()
text = u"wildcards at the begining of a search term "
exprs= [
    u"*cards and term", #True
    u"wild* and term",  #True
    u"not terms",       #True
    u"terms or begin",  #False
]
for expr in exprs:
    print bsp.match(text,expr)
#non-western samples
text = u"안녕하세요, 당신은 어떠세요?"
exprs= [
    u"*신은 and 어떠세요", #True
    u"not 당신은",       #False
    u"당신 or 당",  #False
]
for expr in exprs:
    print bsp.match(text,expr)

It allows wildcard, literal and not searches nested in as many parentheses as you need.

xecgr
  • 5,095
  • 3
  • 19
  • 28