13

I have few documents in a folder and I want to check if all the documents in this folder are indexed or not. To do so, for each document name in the folder, I would like to run through a loop for the documents indexed in ES and compare. So I want to retrieve all the documents.

There are few other possible duplicates of the same question like retrieve all records in a (ElasticSearch) NEST query and enter link description here but they didnt help me as the documentation has changed from that time.(there is nothing about scan in the current documentation)

I tried using client.search<T>() . But as per the documentation, a default number of 10 results are retrieved. I would like to get all the records without mentioning the size of records ? (Because the size of the index changes)

Or is it possible to get the size of the index first and then send this number as input to the size to get all the documents and loop through?

Community
  • 1
  • 1
ASN
  • 1,655
  • 4
  • 24
  • 52
  • Did you try using scroll? https://www.elastic.co/guide/en/elasticsearch/client/net-api/1.x/scroll.html – Russ Cam Jun 14 '16 at 00:56
  • Hi Russ. I tried using it and was able to get the scrollId. Once I get a scrollId, I dont know how to run the search query again (which will generate some more scrollId's I believe) till I retrieve all the documents list. I didnt find any example in NEST for the same. (I was checking the 2.x version of documentation. Anyways will try it with the example given in the link you have posted) Thanks. – ASN Jun 14 '16 at 01:19
  • 1
    The link in the first comment has an example - it executes a search specifying search type of `scroll`, then uses the scroll id to get the first page of results. It then loops to get all documents, using the scroll id returned from the last response. You can also use `fields` in conjunction to get say only one field of the document back for each result, rather than returning the whole document – Russ Cam Jun 14 '16 at 01:20
  • Tried it and its working.. Thanks a ton Russ. But `SearchType(Nest.SearchType.Scan)` doesnt seems to be working. I had to use `SearchType(Elasticsearch.Net.SearchType.Scan)`. After using the scrolls do I have to delete the scrolls or will they get cleared off after the mentioned time? – ASN Jun 14 '16 at 01:24
  • 1
    https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-search-context – Russ Cam Jun 14 '16 at 01:26
  • @RussCam Is it possible to get multiple fields of the document for each result? I'm using something like this. `.Fields(f=>f.Field(fi=>fi.FilePath).Field("File.ModifiedDate"))` But it doesn't seem to be working. Can you please help me to solve this. more info on this [link](http://s31.postimg.org/67oh6xlzv/fields.png) I was able to get the complete document and from there I could get the fields. But retrieving entire document is not making any sense to me as I dont other fields . So I just want to get filepath and modified date fields. Did I miss anything in the above code for fields field? – ASN Jun 14 '16 at 03:28
  • That's the correct use of fields, but you need to make sure the casing is correct e.g. `file.modifiedDate` (if that is a field that exists in the index) – Russ Cam Jun 14 '16 at 03:45
  • @RussCam Is it possible to retrieve the id's and reindex the documents using these id's? (these id's are auto generated, so there is no field in my Document class with Id) – ASN Jun 14 '16 at 04:10
  • Yes, the ids are available on `ISearchResponse.Hits` – Russ Cam Jun 14 '16 at 05:00

2 Answers2

20

Here is how I solved my problem. Hope this helps. (References https://www.elastic.co/guide/en/elasticsearch/client/net-api/1.x/scroll.html , https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-search-context)

List<string> indexedList = new List<string>();
var scanResults = client.Search<ClassName>(s => s
                .From(0)
                .Size(2000)
                .MatchAll()
                .Fields(f=>f.Field(fi=>fi.propertyName)) //I used field to get only the value I needed rather than getting the whole document
                .SearchType(Elasticsearch.Net.SearchType.Scan)
                .Scroll("5m")
            );

        var results = client.Scroll<ClassName>("10m", scanResults.ScrollId);
        while (results.Documents.Any())
        {
            foreach(var doc in results.Fields)
            {
                indexedList.Add(doc.Value<string>("propertyName"));
            }

            results = client.Scroll<ClassName>("10m", results.ScrollId);
        }

EDIT

var response = client.Search<Document>(s => s
                         .From(fromNum)
                         .Size(PageSize)
                         .Query(q => q ....
ASN
  • 1,655
  • 4
  • 24
  • 52
  • I dont quite understand your logic. your first matchAll query will return only 2000 documents. are you doing a scroll over only 2000 docs? What if I have 5000 docs? – Emil Aug 18 '16 at 15:03
  • @batmaci that number 2000 is not the total number of records. It's the count of records to be fetched every time which is valid for some time mentioned in the scroll. (for eg: first I will fetch 0-10 records, then I will fetch 11-20 records and so on. so 2000 is just an example.) – ASN Aug 19 '16 at 02:03
  • do you know if it is ok to set this A high number or any Performance issue can cause. As i know that 10 000 is the max count for a search query – Emil Aug 19 '16 at 07:23
  • I'm not so sure about the performance but i think it will be slow for sure and also you need to increase the scroll time significantly. – ASN Aug 19 '16 at 07:41
  • 1
    @ASN What's the difference in the scroll times where you first use "5m", then use "10m"? What do they each do? – wnbates Jul 05 '17 at 10:30
  • 1
    @wnbates The scroll parameter tells Elasticsearch to keep the search context open for another 5m or 10m. – ASN Jul 06 '17 at 02:40
  • @asn That doesn't explain why there are two different times? – wnbates Jul 07 '17 at 09:58
  • But we need to save the first page too, do not we? Now we loose first page, fix: ```var results = client.Scroll("10m", scanResults.ScrollId);``` -> ```var results = scanResults.Documents;``` – razon Mar 29 '18 at 13:50
  • Wouldn't it be safer ito invoke `client.ClearScroll(results.ScrollId)` at the end of the `while` loop? – remio Jan 28 '20 at 07:47
  • The first elements are missing. Take a look at https://stackoverflow.com/a/56261657/1966464 – GermanSniper Oct 07 '22 at 14:18
-4

You can easily perform the following to get all records in index:

var searchResponse = client.Search<T>(s => s
                                    .Index("IndexName")
                                    .Query(q => q.MatchAll()
                                           )
                                     );

var documents = searchResponse.Documents.Select(f => f.fieldName).ToList();
TheZerg
  • 309
  • 1
  • 3
  • 9