0

By documents , I mean each line item in the Azure Search Index

I need to frequently see what's present in my search index, modify/delete some line items from there. Have been trying to find python scripts/methods that can help me do this easily but seems like there is no straight forward way to this.

Ask : I basically want to export all rows/ some rows(based on filters on filterable fields) from the Search Index I already have in place. Just want to visualize everything in the index in a dataframe/json

I have already seen the below resources:

  1. SearchIndexClient Class documentation - no method to achieve this
  2. SearchClient Class documentation - no method to achieve this
  3. Export data from an Azure Cognitive Search index - Azure Samples - is not in Python

I have already looked into this answer from 2019, but it's too complicated as:

  1. it has variables like facet_value, facet_fields. The page size is being set to 1000 (is this a limitation?)
  2. what does page_size mean compared to the Document object that Azure Search Index calls each row in the index as

Any other answer on Stackoverflow doesn't really answer this..

newbie101
  • 65
  • 7

1 Answers1

1

Based on the provided Information, to export document from search index to a Dataframe you can first use the search method from the azure.search.documents library with filters and then convert the results to Dataframe using pandas.

I have created a index with below schema:

index_definition = {
    "name": index_name,
    "fields": [
        {"name": "id", "type": "Edm.String", "key": True, "searchable": False, "facetable": False, "filterable": False},
        {"name": "category", "type": "Edm.String", "searchable": True, "facetable": True, "filterable": True},
        {"name": "brand", "type": "Edm.String", "searchable": True, "facetable": True, "filterable": True},
        {"name": "price", "type": "Edm.Double", "searchable": False, "facetable": True, "filterable": True},
        {"name": "rating", "type": "Edm.Int32", "searchable": False, "facetable": True, "filterable": True}
    ]
}

Uploaded below data to index:

[
  {
    "id": "1",
    "category": "Laptops",
    "brand": "Brand A",
    "price": "800",
    "rating": "4"
  },
  {
    "id": "2",
    "category": "Laptops",
    "brand": "Brand B",
    "price": "1000",
    "rating": "5"
  },
  {
    "id": "3",
    "category": "Smartphones",
    "brand": "Brand C",
    "price": "600",
    "rating": "4"
  }
]

And with the below code I was succesfully able to create dataframe with the search results:

from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
import pandas as pd

endpoint = f"https://{service_name}.search.windows.net/"
search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=AzureKeyCredential(api_key))

search_text = "*"  
results = search_client.search(search_text=search_text,filter="category eq 'Laptops'" ,include_total_count=True)

df = pd.DataFrame(results)
df

Outputs: enter image description here With Filter: enter image description here

As it's a search result based dataframe, we are getting some extra column which can be removed based on the requirement.

To answer your other question,

In search query the number of search results to retrieve is by defaults to 50 and the max limit per page is 1000.

For more details, please check the Search Document (Query Parameters -> $top)

RishabhM
  • 525
  • 1
  • 5
  • Thanks. Is there no way to extract all documents without knowing how many total documents will come out in the given filter ? It defaulting to 50 is a problem, right ? What if my filter results in 237 documents - How do I extract all 237 in this case – newbie101 Aug 02 '23 at 12:02
  • You can try to give "top" parameter with higher value in search method. [link](https://learn.microsoft.com/en-us/python/api/azure-search-documents/azure.search.documents.searchclient?view=azure-python#azure-search-documents-searchclient-search) – RishabhM Aug 02 '23 at 12:47