6

I want to use the great expectations package to validate that a column in a .csv file only contains integers.

The file I am using has only integers in the age column except for one row which has a '`' character instead. This is what I want the expectation to catch. I have also checked the .csv file in a text editor and can confirm that the ages in the age column are not enclosed in quotes.

However, the expectation fails one 100% of the data. I think it is because pandas is reading the column in as the object type (so a string) because of the one incorrect row. I can preprocess this using something like .astype(int) becauseit will fail on that row. And wrapping .astype(int) in a try block would completely defeat the purpose of using great expectations for this.

Here is a minimal working example:

good.csv:

age,name
34,Fred
22,Bob
54,Mary

bad.csv:

age,name
34,Fred
`,Bob
54,Mary

Code:

import great_expectations as ge

df = ge.read_csv("./good.csv");
my_df.expect_column_values_to_be_of_type('age','int')

df = ge.read_csv("./bad.csv");
my_df.expect_column_values_to_be_of_type('age','int')

The first case returns

{'success': True,
 'result': {'element_count': 3,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'unexpected_percent_nonmissing': 0.0,
  'partial_unexpected_list': []}}

So all the ages are ints and it succeeds on every row. I expect the second case to fail, but only on the second row. However it fails on all the rows:

{'success': False,
 'result': {'element_count': 3,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 3,
  'unexpected_percent': 1.0,
  'unexpected_percent_nonmissing': 1.0,
  'partial_unexpected_list': ['34', '`', '54']}}

So I guess I expect something like this:

{'success': False,
 'result': {'element_count': 3,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 1,
  'unexpected_percent': 0.33,
  'unexpected_percent_nonmissing': 1.0,
  'partial_unexpected_list': ['`']}}

Is there something I am doing wrong, or is the package just not capable of this?

Abe
  • 22,738
  • 26
  • 82
  • 111
Dan
  • 45,079
  • 17
  • 88
  • 157
  • 1
    In a pandas dataframe there is only one type. Hence in your second dataframe where you have a non-numeric value, the entire column get casted as a string. – Scott Boston Oct 31 '18 at 13:51
  • Just my suggestion : kindly remove pandas tag , since this is nothing about pandas – BENY Oct 31 '18 at 13:55
  • 1
    @Scott Boston that's not true. A column can contain multiple types, but when you read a file into a `DataFrame` `pandas` will only choose one (or treat your column values as `object`). – user3471881 Oct 31 '18 at 14:09
  • 1
    `df["age"] = [34, "´", 54]` would result in `type(df["age"][0]) == int` – user3471881 Oct 31 '18 at 14:12
  • @user3471881 Type object is a string... your 34 and 54 are type object which is a string. – Scott Boston Oct 31 '18 at 14:18
  • `object` can contain multiple types. `df["age"] = [34, "´", 54]` would result in `type(df["age"][0]) == int` but `df["age"].dtype == object`. – user3471881 Oct 31 '18 at 14:23
  • @W-B it is pandas under the hood that is casting everything as a string (i.e. this just uses pandas read_csv function which is where the issue lies) – Dan Oct 31 '18 at 14:57
  • @ScottBoston no that's not true. pandas read_csv will make everything a string, but if you create the dataframe your self from a list with mixed types, it will still have mixed types. – Dan Oct 31 '18 at 14:59

4 Answers4

3

This appears to be an error that is known but not yet resolved (as of 9/2018):

https://github.com/great-expectations/great_expectations/issues/110

Developers are actively working on improving behavior:

  1. expect_column_values_to_be_of_type (the current expectation) has been cleaned up to now be closer to what we think people expect, and we are planning to rename it expect_column_values_to_be_stored_as_type

  2. We plan to introduce a new expectation expect_column_values_to_be_parseasble_as_type that accepts: string, float, int, bool (others?) and focuses on a more semantically meaningful understanding of the type (i.e. if a value is a string but the string is "1.0" then it will pass for string, float, and int).

And hopefully will have results soon:

Implementation on 1 and 2 is about to start, but we're still open to suggestions...

ASGM
  • 11,051
  • 1
  • 32
  • 53
2

The problem is that your values aren't actually integers, they are strings. When you read your csv with great_expectations.read_csv() it uses pandas.read_csv() internally which will default the data in your age column to strings because of the ´.

The method expect_column_values_to_be_of_type() is very basic meaning that it will only check if your value is the type your looking for (isinstance()), not if it "could" be that type.

user3471881
  • 2,614
  • 3
  • 18
  • 34
  • Yes I know it is converting them to strings, that's what I was hoping there was a work around for. I think `expect_column_values_to_be_of_type` is just a confusing name since it sounds like it checks the types of the actual values not the whole column. And the values in the csv *are* ints, it is just the read_csv function that messes them up. – Dan Oct 31 '18 at 15:02
  • It does check the types of the actual values but the type of the actual values is `str` not `int` :) I agree with you that the naming is off. – user3471881 Oct 31 '18 at 15:07
  • The actual code that `ge` uses when you call that method is: `df["age"].map(lambda x: isinstance(x, int))` – user3471881 Oct 31 '18 at 15:11
  • The types of the values in the df are strings yes, but that's because ge made them strings. The 'types' in the .csv i.e. the data I'm checking, are ints. – Dan Oct 31 '18 at 16:33
  • I guess that is a philosophical question :D – user3471881 Oct 31 '18 at 17:09
1

As a workaround until the new expect_column_values_to_be_parseasble_as_type is implemented, I can achieve the same result using a regex expectation instead:

my_df = ge.read_csv("bad.csv")
pattern = r'^\d+$'
result  = my_df.expect_column_values_to_match_regex('age',
                                                    pattern,
                                                    result_format={'result_format': 'COMPLETE'})

result
# In my case I'm only interested in where it went wrong
# print(result['result']['unexpected_list'])
# print(result['result']['unexpected_index_list'])

which gives:

{'success': False,
 'result': {'element_count': 3,
  'missing_count': 0,
  'missing_percent': 0.0,
  'unexpected_count': 1,
  'unexpected_percent': 0.3333333333333333,
  'unexpected_percent_nonmissing': 0.3333333333333333,
  'partial_unexpected_list': ['`'],
  'partial_unexpected_index_list': [1],
  'partial_unexpected_counts': [{'value': '`', 'count': 1}],
  'unexpected_list': ['`'],
  'unexpected_index_list': [1]}}

Note if you want to allow a leading + or - you need to change the pattern to:

pattern = r'^[+-]?\d+$'
Dan
  • 45,079
  • 17
  • 88
  • 157
0

I am not aware of great_expectations, But you can solve this in pandas simply using this,

print df[pd.to_numeric(df['age'].fillna(0),errors='coercs').isnull()]['age'].tolist()

Output for good.csv

[]

Output for bad.csv

['`']
Mohamed Thasin ah
  • 10,754
  • 11
  • 52
  • 111
  • Sure, I would probably just do `df['age'].str.isnumeric()` so I can report back on the row numbers as well. But I would like to use great expectations for its other types of checks as well, including distributions, so one framework to manage all my validation. So I'd rather wait to hear if this is possible in great expectations or if I should raise an issue about it. – Dan Oct 31 '18 at 13:50