2

I have a data frame df like this

A        B        C        D        E        F        G        H
a.1      b.1     
                  
                  c.1      d.1 
                  c.2      d.2           e.1      f.1 
                                                      

                                                     g.1       h.1
  


Create the sample DataFrame

from io import StringIO

s = """A,B,C,D,E,F,G,H
a.1,b.1,,,,,,
,,c.1,d.1,,,,
,,c.2,d.2,e.1,f.1,,
,,,,,,g.1,h.1"""

df = pd.read_csv(StringIO(s))

I want to remove these extra spaces and I want dataframe to start from the top row. Can anyone help.

my desired results would be

A        B        C        D        E        F        G        H
a.1      b.1      c.1      d.1      e.1      f.1      g.1       h.1
                  c.2      d.2                                                   
sdave
  • 531
  • 4
  • 18
  • Use the first solution from this post - https://stackoverflow.com/questions/43119503/how-to-remove-blanks-nas-from-dataframe-and-shift-the-values-up – Arkadiusz Jun 09 '21 at 19:44

1 Answers1

2

You can shift back each column by the number of preceding missing values which is found with first_valid_index:

df.apply(lambda s: s.shift(-s.first_valid_index()))

to get

     A    B    C    D    E    F    G    H
0  a.1  b.1  c.1  d.1  e.1  f.1  g.1  h.1
1  NaN  NaN  c.2  d.2  NaN  NaN  NaN  NaN
2  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN
3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN

To drop the rows full of NaNs and fill the rest with empty string:

out = (df.apply(lambda s: s.shift(-s.first_valid_index()))
         .dropna(how="all")
         .fillna(""))

to get

>>> out

     A    B    C    D    E    F    G    H
0  a.1  b.1  c.1  d.1  e.1  f.1  g.1  h.1
1            c.2  d.2

note: this assumes your index is 0..N-1; so if it's not, you can store it beforehand and then restore back:

index = df.index
df = df.reset_index(drop=True)
df = (df.apply(lambda s: s.shift(-s.first_valid_index()))
        .dropna(how="all")
        .fillna(""))
df.index = index[:len(df)]

To make the pulling up specific to some columns:

def pull_up(s):
    # this will be a column number; `s.name` is the column name
    col_index = df.columns.get_indexer([s.name])

   # for example: if `col_index` is either 7 or 8, pull by 4
   if col_index in (7, 8):
       return s.shift(-4)
   else:
       # otherwise, pull as much
       return s.shift(-s.first_valid_index())

# applying
df.apply(pull_up)
Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
  • umm, sorry I missed to mention that this data is example data and in real data index for different columns is different. Basically, in column 3,4 I need to shift 1 tab up...5,6 will need 2 tabs up....7,8 needs to be pulled 3 tabs up and column 9,10 will need to be shifted 4 tabs – sdave Jun 10 '21 at 06:46
  • @sdave Yes in that case you can use the last snippet with `reset_index`; can you please try that? – Mustafa Aydın Jun 10 '21 at 06:47
  • yes, I will do it now, my question is instead of df (the whole dataframe) i can simply use df['c'] as well right to work only on the specefic columns? – sdave Jun 10 '21 at 06:52
  • @sdave Sure, it should be working that way. – Mustafa Aydın Jun 10 '21 at 06:55
  • it works flawlessly where the first_valid_index is column 0 or where we have first non-NaN value. ```df.iloc[:, 6:8]=df.iloc[:, 6:8].apply(lambda s: s.shift(-s.first_valid_index()))``` but for column 7 and 8 , I have multiple NaN values untill row 50 but I want to pull it only 4 rows up. Basically the column should begin with NaN values but the first 4 rows should be removed and data should be pulled above. Is there any way I can define what (row column combination) should be the first valid index? – sdave Jun 10 '21 at 07:24
  • @sdave I edited the answer as far as I understood, hope it helps. – Mustafa Aydın Jun 10 '21 at 07:32
  • It worked like a charm, now lastly i have complely empty rows in dataframe , I want to delete them only when all the coumns in the row are empty. dropna is not helping – sdave Jun 10 '21 at 08:06
  • @sdave `dropna` has parameter `how` so if you say `df.dropna(how="all")`, it will drop a row only if *all* the columns for that row are empty. – Mustafa Aydın Jun 10 '21 at 08:14
  • can you have a look at this one please https://stackoverflow.com/questions/67917573/replace-nan-with-sign-only-in-specefic-condition-python-pandas – sdave Jun 10 '21 at 08:34
  • can you please have a look at https://stackoverflow.com/questions/68020714/how-to-remove-extra-while-reading-a-pandas-dataframe – sdave Jun 17 '21 at 13:48
  • @sdave I saw that but don't know the answer, I hope someone else does. – Mustafa Aydın Jun 17 '21 at 13:50
  • Thanks, no problem :) – sdave Jun 17 '21 at 13:53