Only a fraction of the dataframe is merged in pandas - python

Question

My problem is simple. I have a pandas dataframe with 124957 different tweets (related to a center-topic). The problem is that each date has more than 1 tweet (around 300 per day).

My goal is to perform sentiment analysis on the tweets of each day. In order to solve this, I am trying to combine all tweets of the same day into one string (which corresponds to each date).

To achieve this, I have tried the following:

indx=0
get_tweet=""
for i in range(0,len(cdata)-1):
    get_date=cdata.date.iloc[i]
    next_date=cdata.date.iloc[i+1]
    if(str(get_date)==str(next_date)):
        get_tweet=get_tweet+cdata.text.iloc[i]+" "
    if(str(get_date)!=str(next_date)):
        cdata.loc[indx,'date'] = get_date
        cdata.loc[indx,'text'] = get_tweet
        indx=indx+1
        get_tweet=" "

df.to_csv("/home/development-pc/Documents/BTC_Tweets_1Y.csv")

My problem is that only a small sample of the data is actually converted to my format of choice.

Image of the dataframe

I do not know whether it is of importance, but the dataframe consists of three separate datasets that I combined into one using "pd.concat". After that, I sorted the newly created dataframe by date (ascending order) and reset the index as it was reversed (last input (2020-01-03) = 0 and first input (2019-01-01) = 124958).

Thanks in advance, Filippos

Please [provide a reproducible copy of the DataFrame with `to_clipboard`](https://stackoverflow.com/questions/52413246/how-to-provide-a-copy-of-your-dataframe-with-to-clipboard). [Stack Overflow Discourages Screenshots](https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors). It is likely the question will be downvoted. You are discouraging assistance because no one wants to retype your data or code, and screenshots are often illegible. — Trenton McKinney, May 11 '20 at 07:32

score 0 · Accepted Answer · answered May 11 '20 at 08:17

Without going into the loop you used (think you are only concatating two first instances, not sure) you could use groupby and apply, here is an example:

# create some random data for example
import pandas as pd
import random
import string
dates = random.choices(pd.date_range(pd.Timestamp(2020,1,1), pd.Timestamp(2020,1,6)),k=11)
letters = string.ascii_lowercase
texts = [' '.join([''.join(random.choices(letters, k=random.randrange(2,10))) for x in 
range(random.randrange(3,12))]) for x in range(11)]
df = pd.DataFrame({'date':dates, 'text':texts})

# group
pd.DataFrame(df.groupby('date').apply(lambda g: ' '.join(g['text'])))

Only a fraction of the dataframe is merged in pandas - python

1 Answers1