1

I have a dataset similar to this. Basically it consists of different pages of word documents indicating the page number and also the full text of the page.

{
  "_id": "4b36u6vwkZH16H5vmc24sBfuZk0CRqfP",
  "_rev": "1-r5WQDAJPPuUP0oLapZrMiMRd6rOaTIz9",
  "FILE_NAME": "sample.doc",
  "PAGE_NUM": 1,
  "PAGE_FULLTEXT": "hello world",
},
{
  "_id": "nDIKw5JUWFWVD8m7HEODMa1vNI5gFEXS",
  "_rev": "1-nEp7zsuaneJj2AInyPpeBWDNP90ZGpWQ",
  "FILE_NAME": "sample.doc",
  "PAGE_NUM": 2,
  "PAGE_FULLTEXT": "this is john doe",
},
{
  "_id": "vCTlNbNk3X893FkWSYnn87L9j371taYZ",
  "_rev": "1-oJPspiBHRPeT99m8VPV9qoDTTBoJ9tVK",
  "FILE_NAME": "sample-2.doc",
  "PAGE_NUM": 1,
  "PAGE_FULLTEXT": "this is another document",
},
{
  "_id": "2FSDuaEa5bYtP2l7lEgMnqMnqsZpMJUs",
  "_rev": "1-ZQRkvfMluu0NQWYH2FUATuXy9uNtOGyk",
  "FILE_NAME": "sample-2.doc",
  "PAGE_NUM": 2,
  "PAGE_FULLTEXT": "page 2 of sample-2.doc",
},
{
  "_id": "RET7G6hUU9zSplgW7FIXWKwIVex2NEmI",
  "_rev": "1-mlryGv830RNllPwFT7JDDvJoKXuvxAXD",
  "FILE_NAME": "sample-3.doc",
  "PAGE_NUM": 1,
  "PAGE_FULLTEXT": "hello lionel",
},
{
  "_id": "VBL6BJBevcvUc6EsJ68bAjHuGRJ6zvMt",
  "_rev": "1-fPIJQHKCB2WitR74l1X8I6TOBMhMeCWF",
  "FILE_NAME": "sample-3.doc",
  "PAGE_NUM": 2,
  "PAGE_FULLTEXT": "page hello 2 of sample-3.doc",
}

So far I was able to do a similar querying with Select Distinct Count by checking one of the posts How do I do the SQL equivalent of "DISTINCT" in CouchDB?

Now the problem is that how would I be able to search through the dataset and then group them by FILE_NAME (output similar when SQL code used is SELECT DISTINCT FILE_NAME WHERE PAGE_FULLTEXT like "%hello%")

Community
  • 1
  • 1
Gerard Cruz
  • 641
  • 14
  • 34

1 Answers1

1

The usual equivalent of Distinct in CouchDB is by using a MapReduce view and group_level=1 or group=true at query time.

But the bigger part of your problem is the is the WHERE PAGE_FULLTEXT like "%hello%" bit. MapReduce views are not suited to fuzzy matching as you have indicated.

Luckily, Cloudant has Cloudant Search which allows full-text indexes to be created. Cloudant Search indexes are defined in a function (like MapReduce) using the index function to define the fields to be indexed. At it's simplest, using your sample data, an indexing function would be:

function(doc) {
  index("default", doc.PAGE_FULLTEXT);
}

which indexes your your document digest into the default field.

Once indexed, the view can be queried with /_design/yourdesigndoc/_search/yourindexname?q=hello+world to produce the documents that best match the string "hello world".

Glynn Bird
  • 5,507
  • 2
  • 12
  • 21