0
tweet_id image_url doggo floofer puppa puppo
First row none none none none
Second row doggo none none none
third row none floofer none none
fourth row none none puppa none
fifth row none none none puppo

I have a situation trying to melt the above df columns [doggo,floofer,puppa,puppo] headers which ought to be values. I want to be able to retain the rows count and also have a new_column as table below.

tweet_id image_url breed
First row none
Second row doggo
third row floofer
fourth row puppa
fifth row puppo

I already tried the pd.melt() but it doesn't return desired result. Thank you

2 Answers2

1

For keeping the NaN value in a group if all null in a group

I assume your none is string.

Method 1

first_valid_index

df = df.replace('none', np.nan)
df = df.set_index(['tweet_id','image_url'])

output = df.apply(pd.Series.first_valid_index, axis=1).reset_index()
output.columns = ['tweet_id','image_url', 'breed']
output
###
  tweet_id image_url    breed
0    First       row     None
1   Second       row    doggo
2    third       row  floofer
3   fourth       row    puppa
4    fifth       row    puppo




Method 2

stack and first

cat_type = pd.api.types.CategoricalDtype(categories=df['tweet_id'], ordered=True)
df['tweet_id'] = df['tweet_id'].astype(cat_type)
df = df.set_index(['tweet_id','image_url'])

output = df.stack().replace('none', np.nan).groupby(level=[0,1]).first().reset_index()
output.columns = ['tweet_id','image_url', 'breed']
output
###
  tweet_id image_url    breed
0    First       row     None
1   Second       row    doggo
2    third       row  floofer
3   fourth       row    puppa
4    fifth       row    puppo




Method 3

melt, groupby().apply()

cat_type = pd.api.types.CategoricalDtype(categories=df['tweet_id'], ordered=True)
df['tweet_id'] = df['tweet_id'].astype(cat_type)
df.replace('none', np.nan, inplace=True)

df_melt = df.melt(id_vars=['tweet_id','image_url'], value_vars=['doggo','floofer','puppa','puppo'], var_name='breed')

output = df_melt.groupby(['tweet_id','image_url']).apply(lambda x: np.nan if x['value'].isnull().all() else x['value'].dropna().unique())
output = output.explode().reset_index()
output.columns = ['tweet_id','image_url', 'breed']
output
###
  tweet_id image_url    breed
0    First       row      NaN
1   Second       row    doggo
2    third       row  floofer
3   fourth       row    puppa
4    fifth       row    puppo




Discussion

I used the csv file to process,

df_csv = pd.read_csv('twitter-archive-enhanced.csv')
df_csv = df_csv[['tweet_id', 'source', 'doggo', 'floofer', 'pupper', 'puppo']]
df_csv = df_csv.replace('None', np.nan)

df = df_csv.set_index(['tweet_id','source']).copy()
output = df.apply(pd.Series.first_valid_index, axis=1).reset_index()
output.columns = ['tweet_id','image_url', 'breed']
output
###

enter image description here

Baron Legendre
  • 2,053
  • 3
  • 5
  • 22
  • Thank you for the reply, I actually did try it and even tried troubleshooting in a way. I keep getting _**TypeError: NDFrame.first_valid_index() takes 1 positional argument but 2 were given**_. A possible thing I tried was adding () to the pd.Series.first_valid_index, but gets **_AttributeError: 'Cassie' is not a valid function for 'DataFrame' object_** which is one of the row data. – Yusuf Abdulganiyu Sep 02 '22 at 09:09
  • As you can see, `pd.Series.first_valid_index` didn't take any argument here; How about you tryna using **Method 2**? – Baron Legendre Sep 02 '22 at 09:29
  • 1
    get's this : _**"None of ['tweet_id'] are in the columns"**_ find attached for previewing. _[link](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archive-enhanced/twitter-archive-enhanced.csv)_ – Yusuf Abdulganiyu Sep 02 '22 at 10:41
  • What is the data type of your `none`? – Baron Legendre Sep 02 '22 at 10:41
  • According to the csv file, string **None** with capital **N**, you might need to modify as `df.replace('None', np.nan, inplace=True)` – Baron Legendre Sep 02 '22 at 10:45
  • I used your CSV file to process the workflow, and it didn't stick anywhere. – Baron Legendre Sep 02 '22 at 11:00
  • 1
    Wow, this is great. Sincerely appreciate! Thanks – Yusuf Abdulganiyu Sep 02 '22 at 11:26
0

If you are not familiar with pandas function, you can use a nested loop to get the same result too.

df = df.replace('none', np.nan)
df = df.set_index(['tweet_id','image_url'])

result = []
for row in df.to_numpy():
    if pd.isna(row).all():
        result.append(np.nan)
    for breed in row:
        if pd.notna(breed):
            result.append(breed)
            break
            
[nan, 'doggo', 'floofer', 'puppa', 'puppo']

df['breed'] = result

df.reset_index(inplace=True)

  tweet_id image_url  doggo  floofer  puppa  puppo    breed
0    First       row    NaN      NaN    NaN    NaN      NaN
1   Second       row  doggo      NaN    NaN    NaN    doggo
2    third       row    NaN  floofer    NaN    NaN  floofer
3   fourth       row    NaN      NaN  puppa    NaN    puppa
4    fifth       row    NaN      NaN    NaN  puppo    puppo
  • Thanks for the feedback, I tried it but got **_NameError: name 'nan' is not defined_**, as a result of [nan, 'doggo', 'floofer', 'puppa', 'puppo']. – Yusuf Abdulganiyu Sep 02 '22 at 09:27
  • It sounds like you are assigning it as column name, I think you can update your question even though it has been closed. –  Sep 02 '22 at 10:03
  • And you need numpy.nan when it comes to "nan" –  Sep 02 '22 at 10:03