1

How to ignore empty cells inside a data frame that are randomly spread across the data frame? here is an example of what i have

 ColA    ColB    ColC    ColD    ColF    ColG    ColH
  A               b                                D
          W                      R                 D
  J                        H              T 
  Q               A                                O

There is always 3 entries in each row, but the empty cells are random. I looked here but it is not helpful because of the randomness in my data

The output I am expecting is:

 ColA_New    ColB_New    ColC_New     
  A        b       D      
  W        R       D               
  J        H       T        
  Q        A       O                             
owise
  • 1,055
  • 16
  • 28
  • There are lot of repition for this. But everyone with different heading – Bharath M Shetty Oct 20 '17 at 10:51
  • Are those blank spaces NaN's – Bharath M Shetty Oct 20 '17 at 10:52
  • I replaced them with NaNs to see if I solve it using the answer I referred to, but it did not work because of the randomness. DF.dropna is not helpful either here because I will wither drop the whole column or to keep it as a whole, I can not just extract the non empty cells from each column –  owise Oct 20 '17 at 10:55
  • Is my solution you looking for ? – Bharath M Shetty Oct 20 '17 at 10:56
  • @owise Is this what your CSV looks like? This step can possibly be solved when you load your data actually. See [here](https://stackoverflow.com/a/46847649/4909087). – cs95 Oct 20 '17 at 11:12

4 Answers4

5

similar to @cᴏʟᴅsᴘᴇᴇᴅ's solution:

In [185]: pd.DataFrame(df.stack().values.reshape(-1,3), 
                       columns='ColA_New ColB_New ColC_New'.split(), 
                       index=df.index)
Out[185]:
  ColA_New ColB_New ColC_New
0        A        b        D
1        W        R        D
2        J        H        T
3        Q        A        O
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
4

Assuming your blank spaces are NaNs, use df.notnull in conjunction with numpy's indexing and reshape. Note that the fancy column naming logic works for any number of columns.

print(df)
  ColA ColB ColC ColD ColF ColG ColH
0    A  NaN    b  NaN  NaN  NaN    D
1  NaN    W  NaN  NaN    R  NaN    D
2    J  NaN  NaN    H  NaN    T  NaN
3    Q  NaN    A  NaN  NaN  NaN    O

m = df.notnull()
x = m.sum(1).max()

df = pd.DataFrame(df.values[m].reshape(-1, x))\
           .rename(columns=lambda x: 'Col_' + chr(ord('A') + x))

print(df)
  Col_A Col_B Col_C
0     A     b     D
1     W     R     D
2     J     H     T
3     Q     A     O

If, instead, your dataframe contains blanks, then use df.ne, and the rest is the same.

m = df.ne('')

If this is how your CSV file looks, then you don't need any of this. Instead, the problem can be nipped in the bud while loading your CSV, using pd.read_csv:

df = pd.read_csv('file.csv', skiprows=1, delim_whitespace=True)

If needed, you can rename your columns using rename as described above.

cs95
  • 379,657
  • 97
  • 704
  • 746
3

Also with

ndf = df.replace('',np.nan).apply(sorted,key=pd.isnull,axis=1).add_suffix('_NEW')
ndf = ndf.loc[:,~ndf.isnull().all()]
 ColA_NEW ColB_NEW ColC_NEW
0        A        b        D
1        W        R        D
2        J        H        T
3        Q        A        O
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
3

Solution with dropna:

df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
df1.columns = df.columns[:len(df1.columns)]
print (df1)
  ColA ColB ColC
0    A    b    D
1    W    R    D
2    J    H    T
3    Q    A    O
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Also, `df1 = pd.DataFrame(df.apply(lambda x: x.dropna().tolist(), axis=1).tolist())` – Zero Oct 20 '17 at 13:10