14

Right now I have a DF like this

 Word       Word2          Word3
 Hello      NaN            NaN
 My         My Name        NaN
 Yellow     Yellow Bee     Yellow Bee Hive
 Golden     Golden Gates   NaN
 Yellow     NaN            NaN

What I was hoping for was to remove all of the NaN cells from my data frame. So in the end, it would look like this, where 'Yellow Bee Hive' has moved to row 1 (similarly to what happens when you delete cells from a column in excel) :

   Word       Word2             Word3
1  Hello      My Name        Yellow Bee Hive
2  My         Yellow Bee       
3  Yellow     Golden Gates             
4  Golden       
5  Yellow    

Unfortunately, neither of these work because they delete the Entire ROW!

 df = df[pd.notnull(df['Word','Word2','Word3'])]

or

 df = df.dropna() 

Anyone have any suggestions? Should I reindex the table?

user3682157
  • 1,625
  • 8
  • 29
  • 55
  • 1
    Do you want the non-null values "rolled up" to the start? – DSM Sep 19 '14 at 20:49
  • if I understand you correctly, I would like the non-null values to "move up" after the NaN cells are removed. So 'Yellow Bee Hive' would move from row 3 to row 1 – user3682157 Sep 19 '14 at 20:50
  • 1
    Unfortunately it's partly my fault there isn't a good built-in way to do this, because we fell into discussions over what the call signature should be.. see [GH #6713](https://github.com/pydata/pandas/issues/6713) for some of the history. :-/ – DSM Sep 19 '14 at 21:04

3 Answers3

4

I think you can use this:

df = df.apply(lambda x: pd.Series(x.dropna().values))

For example:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Word':['Hello', 'My', 'Yellow', 'Golden', 'Yellow'],
    'Word2':[np.nan, 'My Name', 'Yellow Bee', 'Golden Gates', np.nan],
    'Word3':[np.nan, np.nan, 'Yellow Bee Hive', np.nan, np.nan]
})

print(df)

Initial dataframe:

     Word         Word2            Word3
0   Hello           NaN              NaN
1      My       My Name              NaN
2  Yellow    Yellow Bee  Yellow Bee Hive
3  Golden  Golden Gates              NaN
4  Yellow           NaN              NaN

and applying this lambda function:

df = df.apply(lambda x: pd.Series(x.dropna().values))

print(df)

gives:

     Word         Word2            Word3
0   Hello       My Name  Yellow Bee Hive
1      My    Yellow Bee              NaN
2  Yellow  Golden Gates              NaN
3  Golden           NaN              NaN
4  Yellow           NaN              NaN

Then you can fill NaN values with empty strings:

df = df.fillna('')

print(df)

     Word         Word2            Word3
0   Hello       My Name  Yellow Bee Hive
1      My    Yellow Bee                 
2  Yellow  Golden Gates                 
3  Golden                               
4  Yellow    
Serhii
  • 149
  • 1
  • 3
2
import numpy as np
import pandas as pd
import functools

def drop_and_roll(col, na_position='last', fillvalue=np.nan):
    result = np.full(len(col), fillvalue, dtype=col.dtype)
    mask = col.notnull()
    N = mask.sum()
    if na_position == 'last':
        result[:N] = col.loc[mask]
    elif na_position == 'first':
        result[-N:] = col.loc[mask]
    else:
        raise ValueError('na_position {!r} unrecognized'.format(na_position))
    return result

df = pd.read_table('data', sep='\s{2,}')

print(df.apply(functools.partial(drop_and_roll, fillvalue='')))

yields

     Word         Word2            Word3
0   Hello       My Name  Yellow Bee Hive
1      My    Yellow Bee                 
2  Yellow  Golden Gates                 
3  Golden                               
4  Yellow     
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Minor: `notnull` is also a method of DataFrames. – DSM Sep 19 '14 at 20:44
  • 1
    Why not use `fillna`? – Shashank Agarwal Sep 19 '14 at 20:47
  • @ShashankAgarwal: That's a better solution. Would you like to write it as an answer? – unutbu Sep 19 '14 at 20:48
  • So I may have been somewhat unclear with my ask and have made edits regarding what I was hoping the final product would be -- I wanted to basically remove all the NaN and move up the cells, sort of like when you delete empty cells from a column in excel. – user3682157 Sep 19 '14 at 20:49
  • Hmm, this is still deleting some of my data -- namely the phrases in the 'Word' column that have no instances in Word2 or Word3 – user3682157 Sep 19 '14 at 21:23
  • @user3682157: Perhaps try the new solution I've posted. It uses a completely different approach. – unutbu Sep 19 '14 at 21:42
2

Since you want the values to move up, you'll have to create a new data frame

Started with -

     Word         Word2
0   Hello           NaN
1      My       My Name
2  Yellow    Yellow Bee
3  Golden  Golden Gates
4  Yellow           NaN

Used following method -

def get_column_array(df, column):
    expected_length = len(df)
    current_array = df[column].dropna().values
    if len(current_array) < expected_length:
        current_array = np.append(current_array, [''] * (expected_length - len(current_array)))
    return current_array

pd.DataFrame({column: get_column_array(df, column) for column in df.columns}

Gives -

     Word         Word2
0   Hello       My Name
1      My    Yellow Bee
2  Yellow  Golden Gates
3  Golden              
4  Yellow              

You can also edit the existing df with the same function -

for column in df.columns:
    df[column] = get_column_array(df, column)
Shashank Agarwal
  • 2,769
  • 1
  • 22
  • 24