5

If I have a a data frame

df = pd.DataFrame({'A': [1.1, 2.2, 3.3], 'B': [4.4, 5.5, 6.6]})

I can use Great Expectations to check the name and dtypes of the columns like so:

import great_expectations as ge

df_asset = ge.from_pandas(df)

# List of expectations
df_asset.expect_column_to_exist('A')
df_asset.expect_column_to_exist('B')
df_asset.expect_column_values_to_be_of_type('A', 'float')
df_asset.expect_column_values_to_be_of_type('B', 'float')

if df_asset.validate()["success"]:
    print("Validation passed")
else:
    print("Validation failed")

But how can I do a similar thing to check the index of the data frame? I.e. if the data frame was instead

df = pd.DataFrame({'A': [1.1, 2.2, 3.3], 'B': [4.4, 5.5, 6.6]}).set_index('A')

I am looking for something like

df_asset.expect_index_to_exist('idx')
df_asset.expect_index_values_to_be_of_type('idx', 'float')

to replace in the list of expectations

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46
Elis
  • 70
  • 10
  • 2
    It doesn't use Great Expectations, but you could use `assert` statements, provided the DF fits in memory: https://realpython.com/python-assert-statement/ and https://pandas.pydata.org/docs/reference/frame.html have starting points – Sarah Messer Jan 18 '23 at 15:01
  • 1
    NB: `gx` is the standard import alias for `great_expectations`: https://docs.greatexpectations.io/docs/guides/connecting_to_your_data/in_memory/pandas/#2-instantiate-your-projects-datacontext – tdy Jan 25 '23 at 02:38

1 Answers1

2

One quick hack is to use .reset_index to convert the index into a regular column:

import great_expectations as ge

df_asset = ge.from_pandas(df.reset_index())

# List of expectations
df_asset.expect_column_to_exist('A')
df_asset.expect_column_to_exist('B')
df_asset.expect_column_values_to_be_of_type('A', 'float')
df_asset.expect_column_values_to_be_of_type('B', 'float')

# index-related expectations
df_asset.expect_column_to_exist('index')
df_asset.expect_column_values_to_be_of_type('index', 'int')

if df_asset.validate()["success"]:
    print("Validation passed")
else:
    print("Validation failed")

Note that the default name for an unnamed index is 'index', but you can also control it with kwarg names (make sure you have pandas>=1.5.0). Here is an example:

df_asset = ge.from_pandas(df.reset_index(names='custom_index_name'))

This could be useful when you want to avoid clashes with existing column names. This approach can also be used for multiple indexes by providing a tuple of custom names.

SultanOrazbayev
  • 14,900
  • 3
  • 16
  • 46