How to identify string and alph-numeric string in python?

Question

Df1:
  Name    Emp_Id   Desgn  Salary
0 Rick    F912-1   Engg    4500
1 Monty   56Df1    Doc     6700
2 Chris   J45-12   Arch2   7800.0
3 Julie   Klgi     Engg-A  KL00
4 Mashy1  5670     Law_y   4100.0

Df2:

COL_NAME   DATATYPE
Name       string
EMP_Id     AlphaNum
Design     string
Salary     Floatorint

How to loop over dataframe to check the values datatype is valid or not? I want to check each column values datatype with respect to df2 datatype information and return false datatype values.For example string column should contain string(no numbers) and alphanumeric datatype must be comb of atleast 1 number and 1 alphabet.

Output:

Col_name  value    dtype   Row_Num
Name       Mashy1  string     4
Emp_Id     Klgi    AlphaNum   3
Emp_Id     5670    AlphaNum   4
Design     Arch2   string     2
Salary     KL00    Floatorint 3

klgi is not alphanumeric, dtype column is expected datatype which saying what it should be. — Mohekar, Dec 06 '19 at 15:47
dtype column in the output is like expected dtype, what we are expecting in that cell which is coming from Df2. — Mohekar, Dec 06 '19 at 15:48

score 0 · Answer 1 · answered Dec 06 '19 at 00:55

While this can be solved using regex, I opted for a Python solution, borrowing from here.

First, a helper function to determine if a string contains a digit.

def hasNumbers(inputString):
    return any(char.isdigit() for char in inputString)

Second, a function to classify objects into your stated types.

def determine_type(x):
    # check if x is a number
    if isinstance(x, (int, float, np.int64, np.float64)):
        return "int or float"

    # check if x is a string
    elif isinstance(x, str):
        # check if the string contains a digit
        if hasNumbers(x):
            return "alphanumeric"
        else:
            # it's a string
            return "string"

Note the edge case here of an int or float being cast a string. "5" is alphanumeric.

Let's make our dataframe, and convert whatever can be converted to a number, to a number:

"""
  name    emp_id   design  salary
0 Rick    F912-1   Engg    4500
1 Monty   56Df1    Doc     6700
2 Chris   J45-12   Arch2   7800.0
3 Julie   Klgi     Engg-A  KL00
4 Mashy1  5670     Law_y   4100.0
"""
df1 = pd.read_clipboard().reset_index(drop=True)
df1 = df1.applymap(lambda x: pd.to_numeric(x, errors="ignore"))

display(df1)

Output:

     name  emp_id  design salary
0    Rick  F912-1    Engg   4500
1   Monty   56Df1     Doc   6700
2   Chris  J45-12   Arch2   7800
3   Julie    Klgi  Engg-A   KL00
4  Mashy1    5670   Law_y   4100

Now, let's melt this dataframe into the format you requested, and apply our custom function.

# melt the df
df1_melted = df1.reset_index().melt(id_vars="index").sort_values("index")

# apply our function
df1_melted["type"] = df1_melted["value"].apply(determine_type)
print(df1_melted.head(10))

Output:

    index variable   value          type
0       0     name    Rick        string
5       0   emp_id  F912-1  alphanumeric
10      0   design    Engg        string
15      0   salary    4500  int or float
1       1     name   Monty        string
6       1   emp_id   56Df1  alphanumeric
11      1   design     Doc        string
16      1   salary    6700  int or float
2       2     name   Chris        string
17      2   salary    7800  int or float

The bulk of the work here was writing (hacking) a custom function to determine datatypes. I really recommend sticking with the built-in types, either pure Python or numpy. If something is a string, then you can do further analysis.

kantal · Answer 2 · 2019-12-27T15:28:41.273

Here is a solution with regex. At first, create the regex patterns and put the 'match' methods of precompiled regex objects into df2:

import re

rnum= r"^(?=^\d*.?\d*$)[\d\.]+$"
ralpha= r"(?i)^(?=.*[-_a-z])(?=.*\d)[-_a-z\d]+$"
rstr= r"(?i)^[-_a-z]+$"

df2["rx"]=df2.DATATYPE.replace({"string":re.compile(rstr).match, \
                                "AlphaNum":re.compile(ralpha).match, \
                                "Floatorint":re.compile(rnum).match})

Then create a dict from them and use it for transforming df1:

d= dict(df2[["COL_NAME","rx"]].values)
r= df1[df1.transform(d).isna()]

    Name Emp_Id Design Salary
0     NaN    NaN    NaN    NaN
1     NaN    NaN    NaN    NaN
2     NaN    NaN  Arch2    NaN
3     NaN   Klgi    NaN   KL00
4  Mashy1   5670    NaN    NaN

Formmating, and adding "data_type" column:

r= r.stack().rename_axis(index=["row_num","col_name"]).reset_index(name="value")
r["data_type"]= df2.set_index("COL_NAME").loc[r.col_name,"DATATYPE"].values   
r= r.reindex(columns=["col_name","value","data_type","row_num"])

  col_name   value   data_type  row_num
0   Design   Arch2      string        2
1   Emp_Id    Klgi    AlphaNum        3
2   Salary    KL00  Floatorint        3
3     Name  Mashy1      string        4
4   Emp_Id    5670    AlphaNum        4

Note: You can use non-regex functions in df2.rx, too.

How to identify string and alph-numeric string in python?

2 Answers2