2

This is a clarification/restatement of a recent question/answer I posted. I'm wondering if my solution is the simplest or most efficient option.

Q: Separate column containing some missing values

I have a dataframe with three columns: df.location with comma-separated longitude-latitude coordinates in string form, df.target, a target variable with integers between 1 and 5 in currently formatted as floats, and df.null, a column that's mostly nan but also has a mix of latitude-longitude coordinates and floats between 1 and 5.

Here's an example df:

df = pd.DataFrame(
      {'target': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: 4.0, 6: 5.0, 7: 4.0, 8: 4.0, 9: 4.0},
       'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: '41.69230795,-72.26691314', 6: '41.70631764,-70.2868794', 7: '41.70687995,-70.28684036', 8: '41.70598417,-70.28671793', 9: '41.69220757,-70.26687248'},
       'null': {0: '41.70477575,-70.28844073', 1: '2', 2: '41.70637091,-70.28704334', 3: '4', 4: '3', 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}
      }
)

For every row where there's a non-missing value in df.null, the values in df.target and df.location are missing. (I have no idea how this came about, but I check the raw JSON I read into the Pandas Dataframe, and sure enough this null key pops up frequently when location and target are missing.) Here's a screenshot of a Seaborn heatmap from my Jupyter Notebook to illustrate:

screenshot

Is it safe to assume some or all of the missing values in df.location and df.target are in df.null? If so, how to do move these values into the appropriate column based on whether they're the lat-lon strings or target floats?

A: Handling with fillna() and str.contains()

Here's my best answer so far — let me know what you think. Basically I just used fillna(value=df.null) to fill all the missing values in df.location and df.target:

df.target.fillna(
    value=df.null,
    inplace=True
)

df.location.fillna(
    value=df.null,
    inplace=True
)

Then I used regex to boolean filter through df.target and df.location and set all the inappropriate values to np.nan:

# Converting columns to type str so string methods work
df = df.astype(str)

# Using regex to change values that don't belong in column to NaN
regex = '[,]'
df.loc[df.target.str.contains(regex), 'target'] = np.nan
    
regex = '^\d\.?0?$'
df.loc[df.location.str.contains(regex), 'location'] = np.nan
    
# Returning `df.level` to float datatype (str is the correct
# datatype for `df.location`
df.target.astype(float)

Is there a better way to do this?

Edit: Changed fillna() cell code so that it works.

semblable
  • 773
  • 1
  • 8
  • 26

1 Answers1

1

Is it safe to assume some or all of the missing values in df.location and df.target are in df.null?

It depends on the initial data. If you have too many to check by hand, you can't know. You can check the dataframe after conversions but you won't be sure.

I with the new use of fillna(value=) (thanks for this, I didn't understand it well), I found a much quicker way to write it:

df = pd.DataFrame(
      {'target': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: 4.0, 6: 5.0, 7: 4.0, 8: 4.0, 9: 4.0},
       'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: '41.69230795,-72.26691314', 6: '41.70631764,-70.2868794', 7: '41.70687995,-70.28684036', 8: '41.70598417,-70.28671793', 9: '41.69220757,-70.26687248'},
       'null': {0: '41.70477575,-70.28844073', 1: '2', 2: '41.70637091,-70.28704334', 3: '4', 4: '3', 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}
      }
).assign(
    target=lambda x: x.target.fillna(value=pd.to_numeric(x.null, errors='coerce')),
    location=lambda x: x.location.fillna(
        value=x.loc[pd.to_numeric(x.null, errors='coerce').isnull(), 'null']
    )
).drop('null', axis='columns')

The previous code gives the following dataframe:

                   location  target
0  41.70477575,-70.28844073     NaN
1                       NaN     2.0
2  41.70637091,-70.28704334     NaN
3                       NaN     4.0
4                       NaN     3.0
5  41.69230795,-72.26691314     4.0
6   41.70631764,-70.2868794     5.0
7  41.70687995,-70.28684036     4.0
8  41.70598417,-70.28671793     4.0
9  41.69220757,-70.26687248     4.0

You can check that there was no values in null and target by checking:

  • values higher than 5 (if there is one, your hypothesis is false, if not, it is still not sure :-))
  • the number of coma in location column.

I leave the older version which give the same result.

Previous version

For the conversion here something without regex:

import pandas as pd
from numpy import nan

df = pd.DataFrame(
      {'target': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: 4.0, 6: 5.0, 7: 4.0, 8: 4.0, 9: 4.0},
       'location': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: '41.69230795,-72.26691314', 6: '41.70631764,-70.2868794', 7: '41.70687995,-70.28684036', 8: '41.70598417,-70.28671793', 9: '41.69220757,-70.26687248'},
       'null': {0: '41.70477575,-70.28844073', 1: '2', 2: '41.70637091,-70.28704334', 3: '4', 4: '3', 5: nan, 6: nan, 7: nan, 8: nan, 9: nan}
      }
).assign(
    # use the conversion to numeric of the null column in order to find values
    # going to target and to location
    new_target=lambda x: pd.to_numeric(x['null'], errors='coerce'),
    new_location=lambda x: x.loc[pd.to_numeric(x['null'], errors='coerce').isnull(), 'null'],
).assign(
    target_without_nan=lambda x: x.new_target.fillna(0.0),
    new_location=lambda x: x.new_location.fillna(''),
    target=lambda x: (x.target_without_nan + x.target.fillna(0.0)).loc[~(x.target.isnull() & x.new_target.isnull())],
    location=lambda x: x.location.fillna('').str.cat(x.new_location.astype(str)).replace('', nan)
).loc[:, ['location', 'target']]

I use the trick from this answer of summing and concatenating for replace nan values of initial columns. I am also preserving nan values that cannot be replaced with the .loc at the last assign of target.

ndclt
  • 2,590
  • 2
  • 12
  • 26
  • Thanks for this answer! Gives me a lot to play with—I havent used .assign() before. I edited the fillna() cell in the OP so that it works — I've had issues in the past filling nan in dfs or in for loops and thought I had gotten around it when i tested this, but it slipped through. – semblable Sep 10 '20 at 14:22
  • 1
    I edit my answer because I find a shorter and nicer way to do the same job. I leave the code of the old answer. – ndclt Sep 10 '20 at 17:58