How do I denormalize my relational data for AWS CloudSearch documents?

Question

AWS CloudSearch expects you will send them flattened documents of your data to index for search which look something like:

[
 {"type": "add",
  "id":   "123456",
  "fields": {
     "account_id": "123456",
     "name": "foo",
     "addresses": []
  }
 }
]

Let's assume I have a database with an accounts table and an addresses table.

There are many addresses for each account. The addresses table has the fields:

address_1
address_2
city
state
zip
account_id (reference field)

How would I denormalize addresses in the CloudSearch document structure so that I can search across all of the columns in accounts and addresses?

Or should I be creating a separate search domain for each table?

alexroussos · Answer 1 · 2015-05-05T15:03:19.970

I'm assuming your use cases to be:

Retrieving addresses by their account_id
Retrieving account_ids by an address
Finding accounts in a particular city/state/zip

I would recommend the following two things:

Index each address as a separate document

I would index each address as a separate document. Having a separate doc for each address will enable you to keep the relationships between different fields (which you would lose if you had an array of cities and an array of states for each account).
Index each field separately

I would index each field (city, state, etc) separately. Breaking out each field will enable you to search them independently (eg get all the addresses in Cleveland, OH), use them as facets, boost scores based on them, etc.

Here's an example of some documents in my proposed schema:

[
 {"type": "add",
  "id":   "<see below>",
  "fields": {
     "account_id": "123456",
     "name": "John Smith",
     "address_1": "1 Main St",
     "address_2": "Apt 1",
     "city": "Davenport",
     "state": IA,
     "zip": 52081
  }
 },
 {"type": "add",
  "id":   "<see below>",
  "fields": {
     "account_id": "123456",
     "name": "John Smith",
     "address_1": "2 Elm St",
     "city": "Lincoln",
     "state": NE,
     "zip": 23452
  }
 }
]

Generating Document IDs:

Note that you'd need some non-random way to construct unique document_ids (unique per account+address, not just per account). Something like the account_id plus a hash of the address,city,state,zip would work, or you could add another column to your table to uniquely identify them (I prefer the latter).

I should also mention that in my case, there are three other tables that are related to the account (account, users, addresses, orders). In order to do what you suggest, it would mean creating a single document structure for all possible fields that need to be indexed. This would mean that for an "address" document, the fields for the other tables could be null and vice versa. See this post: http://stackoverflow.com/questions/15797190/aws-cloudsearch-different-documents-in-1-domain?rq=1 It mentions how nulls are not good and that it's better to put separate doc types in diff domains. — doremi, May 05 '15 at 16:09
That might have been good to mention in your question but I'm not sure it affects anything. It's impossible to say without knowing more about your use case and your requirements around joining data from across domains. I would generally say that Users, Addresses and Orders are three conceptually different things that should each be given their own search domains. You should be looking the data up afterwords in your data store based on search results rather than treating CloudSearch as a datastore, so don't index data that isn't needed for search purposes. — alexroussos, May 05 '15 at 17:59

How do I denormalize my relational data for AWS CloudSearch documents?

1 Answers1