Why is only one dataframe formatted correctly?

Question

I'm working on a personal project and came across something that I didn't understand the outcome of. My aim was to split my list-type column into individual columns (each column having one element of the list) and I was able to do that successfully. However, one way of implementing it doesn't give the result I want, despite the code being the exact(??) same. I have two files football_tweets.py and analyseFiles.py

this is my code for football_tweets.py:

class TweetAnalyser():

    #data = []

    def createDataFrame(self, tweets):
        ##this fucntion creats the dataframe

        df = pd.DataFrame(data=[tweet.full_text for tweet in tweets], columns=['tweets'])

        df['id'] = np.array([tweet.id for tweet in tweets])
        df['retweets'] = np.array([tweet.retweet_count for tweet in tweets])
        df['likes'] = np.array([tweet.favorite_count for tweet in tweets])
        df['created_at'] = np.array([tweet.created_at for tweet in tweets])
        df['emoji_code'] = np.array([tweet_analyser.check_emoji(tweet) for tweet in df['tweets']])
        df['tweet_sentiment'] = np.array([tweet_analyser.analyse_sentiment(tweet) for tweet in df['tweets']])
        return df
         
    def check_emoji(self, tweet):
        #function to convert emoji symbol/char into its unicode
        
        ##translate and check emoji here
        emoji_list = []
        data = regex.findall(r'\X', tweet)
        senti_df = pd.read_csv('new_sentiment_data.csv')
        for word in data:
            if any(char in emoji.UNICODE_EMOJI for char in word):
                #translate word to unicdoe code
                ##append unicode code to list
                try:
                    uni_code = f'U+{ord(word):X}' 
                    emoji_list.append(uni_code)

                except TypeError:
                    pass

        return emoji_list

(there are more functions, but these are the only ones necessary for the question)

I ran the code as follows:

if __name__ == '__main__': 
  
    twitter_client = TwitterClient('Arsenal')
    tweet_analyser = TweetAnalyser()

    api = twitter_client.get_twitter_client_api()

    tweets = twitter_client.get_user_tweets(1212442388981002240, 1236413003127566337)
    
    df = tweet_analyser.createDataFrame(tweets)

    df.to_csv('tweet_file.csv')
    new_df = pd.DataFrame(df.emoji_code.values.tolist()).add_prefix('emoji_')
    print(new_df)

and I received the EXPECTED result:

     emoji_0  emoji_1  emoji_2 emoji_3 emoji_4 emoji_5
0    U+1F60D     None     None    None    None    None
1    U+1F3B6  U+1F4A7     None    None    None    None
2    U+1F4AC  U+1F454  U+1F447    None    None    None
3    U+1F3C6     None     None    None    None    None
4    U+1F602  U+1F454     None    None    None    None
..       ...      ...      ...     ...     ...     ...
373   U+270A     None     None    None    None    None

I then tried this same solution in a separate file, analyseFiles.py as follows and received this result after printing:

def analyse_emoji():

    df = pd.read_csv('tweet_file.csv')
    senti_df = pd.read_csv('new_sentiment_data.csv')
    new_df = pd.DataFrame(df.emoji_code.values.tolist()).add_prefix('emoji_')
    print(new_df)

emoji_0
0                          ['U+1F60D']
1               ['U+1F3B6', 'U+1F4A7']
2    ['U+1F4AC', 'U+1F454', 'U+1F447']
3                          ['U+1F3C6']
4               ['U+1F602', 'U+1F454']
..                                 ...
373                         ['U+270A']

Why did the second implementation not give me the expected result despite the code being the same? Is there a concept that I need to learn/brush up on? tweet_file.csv is where I have stored the dataframe and I'm calling it in the second solution rather than the first, where I create it. Is that where the problem occurs?

***edit ***

print(df) from football_tweets.py:

tweets  ...  tweet_sentiment
0    The crucial moment.\n\n @LacazetteAlex\n\n#AR...  ...         0.000000
1     "...so fresh, so clean..."\n\n#ARSWHU https...  ...         0.333333
2     "I'm really happy with the result because bi...  ...         0.600000
3    Your man of the match today...\n\n @Bernd_Len...  ...         0.000000
4    Just another day on the touchline \n\n @m8ar...  ...         0.000000
..                                                 ...  ...              ...
369                Let's keep this going! ✊\n\n#ARSMUN  ...         0.000000

print(df) from analyseFiles.py:

0             0  ...        0.000000
1             1  ...        0.333333
2             2  ...        0.600000
3             3  ...        0.000000
4             4  ...        0.000000
..          ...  ...             ...
373         373  ...        0.000000

This may be where the problem occurs.

@SaiSreenivas emoji_code is a list-type column of the tweet_file dataframe, containing emoji's unicode codepoints. — bailslearnsstuff, Jul 21 '20 at 15:31
What are the elements in `new_df`? Are they actually list of strings, or just strings. The `pandas` print doesn't distinguish. Look at the source `csv`. csv format is inherently 2d, so list or array elements are stored as their print string, and loaded with out 'parsing'. — hpaulj, Jul 21 '20 at 15:58

score 1 · Answer 1 · answered Jul 21 '20 at 16:19

What I was trying to explore in my comments was how the 'lists' were loaded from the csv.

If I make a dataframe with list elements:

In [314]: df = pd.DataFrame([None,None], columns=['data'])                                           
In [315]: df['data']=[[1,2,3], [4,5]]                                                                
In [316]: df                                                                                         
Out[316]: 
        data
0  [1, 2, 3]
1     [4, 5]
In [317]: df['data'][1]                                                                              
Out[317]: [4, 5]
In [318]: type(_)                                                                                    
Out[318]: list
In [319]: df.to_csv?                                                                                 
In [320]: df.to_csv('test.csv', index=False)                                                         
In [321]: cat test.csv                                                                               
data
"[1, 2, 3]"
"[4, 5]"

The csv contains quoted string display of the lists; csv is a 2d format, and can't handle (directly) the implied third dimension of those lists.

When loaded, the dataframe contains strings, not lists. The strings look just like lists, but look more carefully:

In [322]: df1 = pd.read_csv('test.csv')                                                              
In [323]: df1                                                                                        
Out[323]: 
        data
0  [1, 2, 3]
1     [4, 5]
In [324]: df1['data'][1]                                                                             
Out[324]: '[4, 5]'

Oh I see what you mean! Yes, my second implementation prints strings that look like lists too. Is there a way around this? — bailslearnsstuff, Jul 21 '20 at 16:39
I found the reason why from this solution -https://stackoverflow.com/questions/23111990/pandas-dataframe-stored-list-as-string-how-to-convert-back-to-list thank you so much for pointing me in the right direction! — bailslearnsstuff, Jul 22 '20 at 07:57

Why is only one dataframe formatted correctly?

1 Answers1