0

I have on multiple occasions met problems where all but one field is defined in a search query (some against databases, some not), but it is not known at the time of indexing which field will be unknown. My usual approach is to create an index for every field and do set operations on them, or in the case of SQL, index each field and then select. For a lot of fields, this may or may not be better than a linear search.

But if I know only one field will be missing at a time, is it possible to do better than this in terms of time complexity, or in terms of disk access? Perhaps a data structure with which I am not familiar?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
SeanTater
  • 182
  • 1
  • 5
  • Can you post the time-complexity of your current approach? E.g. what type of indexing are you using? What set intersection algorithm is in use? (i.e. http://stackoverflow.com/questions/4642172/computing-set-intersection-in-linear-time) – Peter Dixon-Moses Dec 03 '15 at 18:25

1 Answers1

1

Taking inspiration from Lucene, you could start with an inverted index structure on disk, and then step a leapfrog iterator across all fields in question.

E.g.

Sample Docs

{"id": 1, "network": "NBC", "show": "Wheel of Fortune", "host": "Pat Sajak", "sponsor": "NordicTrack"}

{"id": 2, "network": "NBC", "show": "Jeopardy", "host": "Alex Trebek", "sponsor": "IBM Watson"}

{"id": 3, "network": "NBC", "show": "The Wizard of Odds", "host": "Alex Trebek", "sponsor": "NordicTrack"}

Inverted indices

network := "NBC" (1,2,3)

show := "Jeopardy" (2), "The Wizard of Odds" (3), "Wheel of Fortune" (1)

host := "Alex Trebek" (2,3), "Pat Sajak" (1)

sponsor := "IBM Watson" (2), "NordicTrack" (1,3),

Query issued for:

Network: NBC

Host: Alex Trebek

Sponsor: NordicTrack

Show: Unknown

Query execution

Iteration Step 1

network := "NBC" (1...

  • update consensus id to 1

  • advance to consensus id in all other queried fields (if exists)


Iteration Step 2

network := "NBC" (1...

host := "Alex Trebek" (2...

Fault: id:2 is higher than consensus id ,

  • update consensus id to 2

  • advance to new consensus id in all other queried fields (if exists)


Iteration Step 3

network := "NBC" (1, 2...

host := "Alex Trebek" (2...

sponsor := "NordicTrack" (1, 3)

Fault: id:3 is higher than consensus id,

  • update consensus id to 3

  • advance to new consensus id in all other queried fields (if exists)


Iteration Step 4

network := "NBC" (1, 2, 3)

host := "Alex Trebek" (2, 3)

sponsor := "NordicTrack" (1, 3)

Match: All queried fields concur,

  • add 3 to list of matches

...

With linear iteration, the number of comparisons performed by leapfrog is the sum of the length of the postings lists for all fields queried.

But the number of comparisons can be reduced using Skip Lists. (Though this requires fast random access to postings).

Community
  • 1
  • 1
Peter Dixon-Moses
  • 3,169
  • 14
  • 18