Compare two data frames and find number of nulls

Question

I have a problem. I heve data frame 1 named "df":

And I have the data frame 2 named "dfP1":

I want to compare the unique rows that exist in colum "Campo a Validar" from "dfP1" vs the columns in "df", if exist a coincidence that count the number of nulls in the colum that match. And then the number of nulls will insert in the data frame df into a new colum name "Numeros_de_nulos" but only in the row 0 (index 0).

Here is something that tried:

    #Validacion de Regla 1
if pd.isnull(df["Nº Línea Cliente"]).values.ravel().sum() > 0:
    nulos = pd.isnull(df["Nº Línea Cliente"]).values.ravel().sum()
    print("Hay {} valores nulos".format(nulos))
    dfP1['Numeros_de_Nulos'] = None

else:
    print ("No hay valores nulos")
dfP1.head()

please copy and paste your dataframe, and your expected output — ansev, Dec 06 '19 at 01:03
@ansev is right; it is a good idea to include your desired output in future posts because it helps us to provide you with what you need — David Golembiowski, Dec 06 '19 at 03:45

score 0 · Answer 1 · answered Dec 06 '19 at 01:47

I think I might have an answer.

# Count number of NULL values in column 'Nº Línea Cliente'
nulos = df['Nº Línea Cliente'].isnull().sum()

# If nulos is greater than zero
if nulos > 0:
    # Create a column of nulls
    dfP1['Numeros_de_Nulos'] = None
    # dfP1['Numeros_de_Nulos'] = 0
    # dfP1['Numeros_de_Nulos'] = np.NaN

    # Use DataFrame.loc[<index>, <column name>] to set a new value
    dfP1.loc[0, 'Numeros_de_Nulos'] = nulos

Output:

   ID_val  Tipo_Validacion     Campo_a_Validar Numeros_de_Nulos
0       1                1    Nº Línea Cliente                1
1       2                2    Nº Línea Cliente             None
2       3                3    Nº Línea Cliente             None
3       4                4    Nº Línea Cliente             None
4       5                1  TIPO DE GARANTIA 1             None

* More on pandas.DataFrame.loc() here.

@ansev `nulos` is an integer value, so I don't know how it would be able to apply the fillna() method? For the second comment, not using loc() generally raises a `SettingWithCopyWarning`. More on that here: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas — Mark Moretto, Dec 06 '19 at 12:06

David Golembiowski · Accepted Answer · 2019-12-06T03:59:55.773

It was a bit challenging to determine what you needed, but this might come close to your ideal solution.

import pandas as pd
from numpy import NaN

# Assuming that these dictionaries accurately reflect
# your DataFrames's contents, then the 
# following might work:

_df = {
    "c1":  [1.0, 3.0, 5.0, 7.0],
    "c2":  [1.0, 3.0, 5.0, 7.0],
    "c3":  [1.0, 3.0, 5.0, 7.0],
    "c4":  [1.0, 3.0, 5.0, 7.0],
    "Nº Línea Cliente": [
        "Hay algo",
        "Hay algo",
        "Hay algo",
        NaN],
    "c6":  [1.0, 3.0, 5.0, 7.0],
    "c7":  [1.0, 3.0, 5.0, 7.0],
    "c8":  [1.0, 3.0, 5.0, 7.0],
    "c9":  [1.0, 3.0, 5.0, 7.0],
    "c10": [1.0, 3.0, 5.0, 7.0],
}

Campo_a_Validar = [
        "Nº Línea Cliente"
        for campo in range(4)]
Campo_a_Validar.append("TIPO DE GARANTIA 1")

_dfP1 = {
    "ID_Val": [1,2,3,4,5],
    "Tipo_Validación": [1, 2, 3, 4, 1],
    "Campo_a_Validar": Campo_a_Validar,
}

# Initializing the DataFrames

df = pd.DataFrame(_df)
dfP1 = pd.DataFrame(_dfP1)

def analizar_para_nulos(_df_, _dfP1_):
    try:
        contar_nulos  = lambda DF, ColName: DF.groupby([ColName])[ColName].nunique()
        nulos_de_df   = contar_nulos(_df_, "Nº Línea Cliente")
        nulos_de_dfP1 = contar_nulos(_dfP1_, "Campo_a_Validar") 
        assert(
            nulos_de_df.values[0] == nulos_de_dfP1.values[0]
        )
        num_nulos = nulos_de_df
        return num_nulos.values[0]
    except AssertionError:
        return 0

# Check whether the number of unique rows is
# equal to the number of unique rows in
# the other table

is_coincidence = analizar_para_nulos(df, dfP1)

if is_coincidence:
    base = [is_coincidence]
    base.extend([""
        for position in range(len(df.c1) - 1)])
    num_columns = len(df.T)
    df.insert(
        loc=num_columns,
        column="Numeros_de_Nulos",
        value=base
    )
    print(df)
else:
    print(df)

Output:

    c1   c2   c3   c4 Nº Línea Cliente   c6   c7   c8   c9  c10 Numeros_de_Nulos
0  1.0  1.0  1.0  1.0         Hay algo  1.0  1.0  1.0  1.0  1.0                1
1  3.0  3.0  3.0  3.0         Hay algo  3.0  3.0  3.0  3.0  3.0                 
2  5.0  5.0  5.0  5.0         Hay algo  5.0  5.0  5.0  5.0  5.0                 
3  7.0  7.0  7.0  7.0              NaN  7.0  7.0  7.0  7.0  7.0

By the way, do the Spanish variables and functions help at all? — David Golembiowski, Dec 06 '19 at 03:42

Compare two data frames and find number of nulls

2 Answers2