Discarding randmly scattered empty spaces in pandas data frame

Question

How to ignore empty cells inside a data frame that are randomly spread across the data frame? here is an example of what i have

 ColA    ColB    ColC    ColD    ColF    ColG    ColH
  A               b                                D
          W                      R                 D
  J                        H              T 
  Q               A                                O

There is always 3 entries in each row, but the empty cells are random. I looked here but it is not helpful because of the randomness in my data

The output I am expecting is:

 ColA_New    ColB_New    ColC_New     
  A        b       D      
  W        R       D               
  J        H       T        
  Q        A       O

There are lot of repition for this. But everyone with different heading — Bharath M Shetty, Oct 20 '17 at 10:51
I replaced them with NaNs to see if I solve it using the answer I referred to, but it did not work because of the randomness. DF.dropna is not helpful either here because I will wither drop the whole column or to keep it as a whole, I can not just extract the non empty cells from each column — owise, Oct 20 '17 at 10:55
@owise Is this what your CSV looks like? This step can possibly be solved when you load your data actually. See [here](https://stackoverflow.com/a/46847649/4909087). — cs95, Oct 20 '17 at 11:12

score 5 · Answer 1 · answered Oct 20 '17 at 10:57

similar to @cᴏʟᴅsᴘᴇᴇᴅ's solution:

In [185]: pd.DataFrame(df.stack().values.reshape(-1,3), 
                       columns='ColA_New ColB_New ColC_New'.split(), 
                       index=df.index)
Out[185]:
  ColA_New ColB_New ColC_New
0        A        b        D
1        W        R        D
2        J        H        T
3        Q        A        O

cs95 · Answer 2 · 2017-10-20T11:10:46.370

4

Assuming your blank spaces are NaNs, use df.notnull in conjunction with numpy's indexing and reshape. Note that the fancy column naming logic works for any number of columns.

print(df)
  ColA ColB ColC ColD ColF ColG ColH
0    A  NaN    b  NaN  NaN  NaN    D
1  NaN    W  NaN  NaN    R  NaN    D
2    J  NaN  NaN    H  NaN    T  NaN
3    Q  NaN    A  NaN  NaN  NaN    O

m = df.notnull()
x = m.sum(1).max()

df = pd.DataFrame(df.values[m].reshape(-1, x))\
           .rename(columns=lambda x: 'Col_' + chr(ord('A') + x))

print(df)
  Col_A Col_B Col_C
0     A     b     D
1     W     R     D
2     J     H     T
3     Q     A     O

If, instead, your dataframe contains blanks, then use df.ne, and the rest is the same.

m = df.ne('')

If this is how your CSV file looks, then you don't need any of this. Instead, the problem can be nipped in the bud while loading your CSV, using pd.read_csv:

df = pd.read_csv('file.csv', skiprows=1, delim_whitespace=True)

If needed, you can rename your columns using rename as described above.

edited Oct 20 '17 at 11:10

answered Oct 20 '17 at 10:54

cs95

379,657
97
704
746

better to consider then as `''` rather than NaN – Bharath M Shetty Oct 20 '17 at 10:55
@Bharathshetty Edited. – cs95 Oct 20 '17 at 10:57
Curious just add another letter into the randomnly scattered data in case you want 4 columns then ? – Bharath M Shetty Oct 20 '17 at 11:06
1

@Bharathshetty If every row has N elements, then it'd work. But if even one row had a different number of elements, it fails. – cs95 Oct 20 '17 at 11:08
You know I have a 100 times faster solution for sorted interested using numpy ? – Bharath M Shetty Oct 20 '17 at 11:08
https://stackoverflow.com/questions/46326140/how-to-sort-a-numpy-array-with-key-as-isnan check this out. Divakar made an alternative solution – Bharath M Shetty Oct 20 '17 at 11:09
@Bharathshetty But this isn't sorting, this is just an index + reshape, so it's even faster. – cs95 Oct 20 '17 at 11:11
I dont know I get to that sorting solution whenever I see this kind of problems. I love the read_csv solution though – Bharath M Shetty Oct 20 '17 at 11:15
@Bharathshetty Cheers :-) – cs95 Oct 20 '17 at 11:20

score 3 · Accepted Answer · answered Oct 20 '17 at 10:55

Also with

ndf = df.replace('',np.nan).apply(sorted,key=pd.isnull,axis=1).add_suffix('_NEW')
ndf = ndf.loc[:,~ndf.isnull().all()]

 ColA_NEW ColB_NEW ColC_NEW
0        A        b        D
1        W        R        D
2        J        H        T
3        Q        A        O

score 3 · Answer 4 · answered Oct 20 '17 at 11:14

3

Solution with dropna:

df1 = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
df1.columns = df.columns[:len(df1.columns)]
print (df1)
  ColA ColB ColC
0    A    b    D
1    W    R    D
2    J    H    T
3    Q    A    O

answered Oct 20 '17 at 11:14

jezrael

822,522
95
1,334
1,252

1

Also, `df1 = pd.DataFrame(df.apply(lambda x: x.dropna().tolist(), axis=1).tolist())` – Zero Oct 20 '17 at 13:10

Discarding randmly scattered empty spaces in pandas data frame

4 Answers4