creating a new column from Merged columns matching a condition in a dataframe

Question

tweet_id	image_url	doggo	floofer	puppa	puppo
First	row	none	none	none	none
Second	row	doggo	none	none	none
third	row	none	floofer	none	none
fourth	row	none	none	puppa	none
fifth	row	none	none	none	puppo

I have a situation trying to melt the above df columns [doggo,floofer,puppa,puppo] headers which ought to be values. I want to be able to retain the rows count and also have a new_column as table below.

tweet_id	image_url	breed
First	row	none
Second	row	doggo
third	row	floofer
fourth	row	puppa
fifth	row	puppo

I already tried the pd.melt() but it doesn't return desired result. Thank you

See this: [melt() function duplicating dataset](https://stackoverflow.com/q/73521602/14627505) (it is the recent one) — Vladimir Fokow, Sep 01 '22 at 13:54
[and this](https://stackoverflow.com/q/61575865/14627505) (2 years ago) — Vladimir Fokow, Sep 01 '22 at 13:56

Baron Legendre · Answer 1 · 2022-09-02T10:58:18.773

For keeping the `NaN` value in a group if all null in a group

I assume your none is string.

Method 1

first_valid_index

df = df.replace('none', np.nan)
df = df.set_index(['tweet_id','image_url'])

output = df.apply(pd.Series.first_valid_index, axis=1).reset_index()
output.columns = ['tweet_id','image_url', 'breed']
output
###
  tweet_id image_url    breed
0    First       row     None
1   Second       row    doggo
2    third       row  floofer
3   fourth       row    puppa
4    fifth       row    puppo

Method 2

stack and first

cat_type = pd.api.types.CategoricalDtype(categories=df['tweet_id'], ordered=True)
df['tweet_id'] = df['tweet_id'].astype(cat_type)
df = df.set_index(['tweet_id','image_url'])

output = df.stack().replace('none', np.nan).groupby(level=[0,1]).first().reset_index()
output.columns = ['tweet_id','image_url', 'breed']
output
###
  tweet_id image_url    breed
0    First       row     None
1   Second       row    doggo
2    third       row  floofer
3   fourth       row    puppa
4    fifth       row    puppo

Method 3

melt, groupby().apply()

cat_type = pd.api.types.CategoricalDtype(categories=df['tweet_id'], ordered=True)
df['tweet_id'] = df['tweet_id'].astype(cat_type)
df.replace('none', np.nan, inplace=True)

df_melt = df.melt(id_vars=['tweet_id','image_url'], value_vars=['doggo','floofer','puppa','puppo'], var_name='breed')

output = df_melt.groupby(['tweet_id','image_url']).apply(lambda x: np.nan if x['value'].isnull().all() else x['value'].dropna().unique())
output = output.explode().reset_index()
output.columns = ['tweet_id','image_url', 'breed']
output
###
  tweet_id image_url    breed
0    First       row      NaN
1   Second       row    doggo
2    third       row  floofer
3   fourth       row    puppa
4    fifth       row    puppo

Discussion

I used the csv file to process,

df_csv = pd.read_csv('twitter-archive-enhanced.csv')
df_csv = df_csv[['tweet_id', 'source', 'doggo', 'floofer', 'pupper', 'puppo']]
df_csv = df_csv.replace('None', np.nan)

df = df_csv.set_index(['tweet_id','source']).copy()
output = df.apply(pd.Series.first_valid_index, axis=1).reset_index()
output.columns = ['tweet_id','image_url', 'breed']
output
###

Thank you for the reply, I actually did try it and even tried troubleshooting in a way. I keep getting _**TypeError: NDFrame.first_valid_index() takes 1 positional argument but 2 were given**_. A possible thing I tried was adding () to the pd.Series.first_valid_index, but gets **_AttributeError: 'Cassie' is not a valid function for 'DataFrame' object_** which is one of the row data. — Yusuf Abdulganiyu, Sep 02 '22 at 09:09
As you can see, `pd.Series.first_valid_index` didn't take any argument here; How about you tryna using **Method 2**? — Baron Legendre, Sep 02 '22 at 09:29
get's this : _**"None of ['tweet_id'] are in the columns"**_ find attached for previewing. _[link](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv)_ — Yusuf Abdulganiyu, Sep 02 '22 at 10:41
According to the csv file, string **None** with capital **N**, you might need to modify as `df.replace('None', np.nan, inplace=True)` — Baron Legendre, Sep 02 '22 at 10:45
I used your CSV file to process the workflow, and it didn't stick anywhere. — Baron Legendre, Sep 02 '22 at 11:00

score 0 · Answer 2 · answered Sep 01 '22 at 16:50

If you are not familiar with pandas function, you can use a nested loop to get the same result too.

df = df.replace('none', np.nan)
df = df.set_index(['tweet_id','image_url'])

result = []
for row in df.to_numpy():
    if pd.isna(row).all():
        result.append(np.nan)
    for breed in row:
        if pd.notna(breed):
            result.append(breed)
            break
            
[nan, 'doggo', 'floofer', 'puppa', 'puppo']

df['breed'] = result

df.reset_index(inplace=True)

  tweet_id image_url  doggo  floofer  puppa  puppo    breed
0    First       row    NaN      NaN    NaN    NaN      NaN
1   Second       row  doggo      NaN    NaN    NaN    doggo
2    third       row    NaN  floofer    NaN    NaN  floofer
3   fourth       row    NaN      NaN  puppa    NaN    puppa
4    fifth       row    NaN      NaN    NaN  puppo    puppo

Thanks for the feedback, I tried it but got **_NameError: name 'nan' is not defined_**, as a result of [nan, 'doggo', 'floofer', 'puppa', 'puppo']. — Yusuf Abdulganiyu, Sep 02 '22 at 09:27
It sounds like you are assigning it as column name, I think you can update your question even though it has been closed. — , Sep 02 '22 at 10:03