For keeping the NaN
value in a group if all null in a group
I assume your none
is string.
Method 1
first_valid_index
df = df.replace('none', np.nan)
df = df.set_index(['tweet_id','image_url'])
output = df.apply(pd.Series.first_valid_index, axis=1).reset_index()
output.columns = ['tweet_id','image_url', 'breed']
output
###
tweet_id image_url breed
0 First row None
1 Second row doggo
2 third row floofer
3 fourth row puppa
4 fifth row puppo
Method 2
stack
and first
cat_type = pd.api.types.CategoricalDtype(categories=df['tweet_id'], ordered=True)
df['tweet_id'] = df['tweet_id'].astype(cat_type)
df = df.set_index(['tweet_id','image_url'])
output = df.stack().replace('none', np.nan).groupby(level=[0,1]).first().reset_index()
output.columns = ['tweet_id','image_url', 'breed']
output
###
tweet_id image_url breed
0 First row None
1 Second row doggo
2 third row floofer
3 fourth row puppa
4 fifth row puppo
Method 3
melt
, groupby()
.apply()
cat_type = pd.api.types.CategoricalDtype(categories=df['tweet_id'], ordered=True)
df['tweet_id'] = df['tweet_id'].astype(cat_type)
df.replace('none', np.nan, inplace=True)
df_melt = df.melt(id_vars=['tweet_id','image_url'], value_vars=['doggo','floofer','puppa','puppo'], var_name='breed')
output = df_melt.groupby(['tweet_id','image_url']).apply(lambda x: np.nan if x['value'].isnull().all() else x['value'].dropna().unique())
output = output.explode().reset_index()
output.columns = ['tweet_id','image_url', 'breed']
output
###
tweet_id image_url breed
0 First row NaN
1 Second row doggo
2 third row floofer
3 fourth row puppa
4 fifth row puppo
Discussion
I used the csv file to process,
df_csv = pd.read_csv('twitter-archive-enhanced.csv')
df_csv = df_csv[['tweet_id', 'source', 'doggo', 'floofer', 'pupper', 'puppo']]
df_csv = df_csv.replace('None', np.nan)
df = df_csv.set_index(['tweet_id','source']).copy()
output = df.apply(pd.Series.first_valid_index, axis=1).reset_index()
output.columns = ['tweet_id','image_url', 'breed']
output
###
