0

I pulled my data from a very poorly formatted csv file and am trying to do some cleaning. Right now I have values in one column that incorrectly correspond to another and I need to shift the values in only one column so they correctly correspond.

My data looks roughly like this:

df = 
   ref  name  address
1  1.2  name1  
2  1.2        address1
3  1.3  name2
4  1.3  name2
5  1.3        address2
6  1.3        address2 
7  1.4  name3
8  1.4  name3
9  1.5  name4
10 1.5        address4

And continuing like so with sometimes two address or three or four for about another thousand rows.

Ideally, I'd like to make it so my data looks like this:

df = 
   ref  name  address
1  1.2  name1 address1
3  1.3  name2 address2
4  1.3  name2 address2
5  1.4  name3
6  1.4  name3
7  1.5  name4 address4 

with the null rows removed and the addresses shifted up to properly correspond. Please let me know if you have you have any ideas on how to do this, it's been giving me a lot of trouble. Thanks in advance.

scotton16
  • 13
  • 2

2 Answers2

0

We can do groupby + shift NaN : PS you can check the justify by Divakar to speed up the whole process

pd.concat([ y.apply(lambda t : sorted(t,key=pd.notnull)).dropna(thresh=2) for x , y in df.groupby('ref') ])
Out[121]: 
    ref   name   address
2   1.2  name1  address1
5   1.3  name2  address2
6   1.3  name2  address2
7   1.4  name3       NaN
8   1.4  name3       NaN
10  1.5  name4  address4
BENY
  • 317,841
  • 20
  • 164
  • 234
0

Given the exact format of your data you could run:

data.set_index('ref').applymap(lambda x: np.nan if x=='' else x).apply(lamdba x: x.dropna())

This first replaces all empty cells with NaN, then takes each column one at a time and drops all NaN rows. These are then consolidated as a dataframe by matching on ref (which is now the index).

If your data already has the missing values as NaN then it's just

data.set_index('ref').apply(lamdba x: x.dropna().reset_index())
Myccha
  • 961
  • 1
  • 11
  • 20
  • Thanks so much for your help. The missing values are NaN, I've tried the second one but I'm getting a syntax error: ```df.set_index('ref').apply(lamdba x: x.dropna().reset_index()) df.set_index('ref').apply(lamdba x: x.dropna().reset_index()) ^ SyntaxError: invalid syntax ``` What am I doing wrong? – scotton16 Nov 08 '19 at 16:32