I am working on a selfProjet for my own interest on data retrieval. I have one text file with the following format.
.I 1
.T
experimental investigation of the aerodynamics of a
wing in a slipstream . 1989
.A
brenckman,m.
.B
experimental investigation of the aerodynamics of a
wing in a slipstream .
.I 2
.T
simple shear flow past a flat plate in an incompressible fluid of small
viscosity .
.A
ting-yili
.B
some texts...
some more text....
.I 3
...
".I 1" indicate the beginning of chunk of text corresponding to doc ID1 and ".I 2" indicates the beginning of chunk of text corresponding to doc ID2.
I did:
- split the docs and put them in separate files
- delete stopwords (and, or, while, is, are, ...)
- stem the words to get the root of each (achievement, achieve, achievable, ...all converted to achiv and so on)
- and finally create e TreeMultiMap which looks like this:
{key: word} {Values are arraylist of docID and frequency of that word in that docID}
aerodynam [[Doc_00001,5],[Doc_01344,4],[Doc_00123,3]]
book [[Doc_00562,6],[Doc_01111,1]]
....
....
result [[Doc_00010,5]]
....
....
zzzz [[Doc_01235,1]]
Now my questions: Suppose that user is interested to know:
- what documents does have achieving and book? (idea)
- documents which has achieving and skills but not book nor video
- document include Aerodynamic
- and some other simple queries like this
(input) so suppose she enters
- achieving AND book
- (achieving AND skills) AND (NOT (book AND video))
- Aerodynamic
- .....and some other simple queries
(Output)
- [Doc_00562,6],[Doc_01121,5],[Doc_01151,3],[Doc_00012,2],[Doc_00001,1]
- ....
as you can see there might be
- Some precedence modifier (parenthesis which we dont know the depth)
- precedence of AND, OR, NOT
- and some other interesting challenges and issues
So, I would like to run the queries against the TreeMultimap and search in the words(key) and retrieve the Values(list of docs) to user.
how should I think about this problem and how to design my solution? what articles or algorithms should i read? any idea would be appreciated. (thanks for reading this long post)