0

For a current project, I am planning to group a Pandas DataFrame by stock_symbol as first criterium and quarter as second criterium.

From other threads, I have seen that a structure like group_data = df.groupby(['stock_symbol', 'quarter']) could be a possible solution for this point. In the given case, I am however only receiving the terminal output <pandas.core.groupby.generic.DataFrameGroupBy object at 0x11fdcbf10>.

Does anyone find my thinking error with this line? The relevant code section looks like this:

# Datetime conversion
df['date'] = pd.to_datetime(df['date'])
# Adding of 'Quarter' column
df['quarter'] = df['date'].dt.to_period('Q')
# Grouping both the Stock Symbol and the Quarter column
group_data = df.groupby(['stock_symbol', 'quarter'])
print(group_data)

The function to be called in the operations is highlighted below:

# Word frequency analysis
def get_top_n_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

And the corresponding DataFrame has the following structure:

[
{"gld_index": "1-0", "stock_symbol": "AMG", "gld_id": "7172", "date": "2013-01-01", "author_job_title": "Current Employee - Vice President", "author_location": "Prides Crossing, MA", "txt_main": "I have been working at Affiliated Managers Group full-time (More than 5 years)", "txt_pro": "AMG has built and continues to develop its position as a world-class asset management company. Working in this entrepreneurial culture enables smart, driven and focused individuals the opportunity to further the company’s mission to be a global leader in the asset management industry. If you are looking for an intellectually challenging workplace there are few competitive firms within the financial services industry that offer a comparable range of professional opportunities. As AMG grows and expands its footprint, there will continue to be new and exciting positions within the business for employees to advance their careers.", "txt_con": "Given the “campus” is located north of Boston, meeting up for drinks with friends and associates after work can be challenging.", "txt_adviceMgmt": null, "rating_recommend": 2, "rating_outlook": 2, "rating_ceo": 2, "scr_avg": 5.0, "scr_balance": 5.0, "scr_values": 5.0, "scr_opportunities": 5.0, "scr_benefits": 5.0, "scr_management": 5.0},
{"gld_index": "1-1", "stock_symbol": "AMG", "gld_id": "7172", "date": "2014-03-13", "author_job_title": "Former Contractor - Anonymous Contractor", "author_location": "Beverly, MA", "txt_main": "I worked at Affiliated Managers Group as a contractor (Less than a year)", "txt_pro": "No reason whatsoever to work at AMG as a temp from an employment agency especially if you are the only one from your agency - You will be depressed or go insane! They are temp/contractor unfriendly.\n\nIf you are from an audit firm, you probably don't have much choice but to go and represent your audit firm. In your case, the fact that you will most likely come as a group from your audit firm may help you maintain your sanity.\n\nThe good thing about being a part of an audit group @ AMG is that they will assign a room to your group with a large camera close-by to watch and listen to your conversations.\n\nAnother good thing is that they will provide a phone room (probably bugged with an eavesdropping device) to secretly record your personal conversations.\n\nFinally, IT just got smart by adding the word 'contractor' to the e-mail addresses of contractors or temps! This way, they are alerting everyone to filter you out from the several, daily secret e-mails!!", "txt_con": "Your employment agency will most likely paint the most beautiful picture of this gated 'castle' located somewhere in \"wealth land\" 600 Some Street in Beverly, whereby chefs cook magnificent free lunch for the employees, cozy gym in a mansion etc.....well, they probably don't know you are not allowed in the gym, you are not part of this free lunch program which is meant for the privileged full-time AMG staff. They will also forget to tell you about this secret culture @ AMG!!! Is it a cult?\n\nYou are confined to your desk or room (in case of a audit group). You are allowed to go to the bathroom assigned to your group, you can use the photocopy machine (Camera? Where?). If you are the type that likes to stretch your legs after sitting for long hours, be careful where you go and how often!! HR will probably just appear like a fairy in front of you from nowhere and ask if they can walk you around (knowing fully well that you have worked there for months and you know your way around your confinement!)\n\nWhat is all this secrecy about? It seems to be more than just protecting vital information (which is normal for any company and understandable considering the nature of their business). Is there more to it?", "txt_adviceMgmt": "Use some of your charitable contributions to provide food for contractors/ temps. You can begin charity from home! Food is way too cheap in America!!!....especially the type of nature-abundant leaves and flowers you serve for lunch.\n\nAlso, keep everyone healthy irrespective of employment status- Open the gym for contractors too!! They will work more efficiently (great ROI). Remember to put posters all over the gym to read \" shhhhh\". This way full-time employees will remember not to discuss those secrets!!!\n\nGive some incentives to those front desk girls....they'll stay longer!", "rating_recommend": 0, "rating_outlook": 1, "rating_ceo": 1, "scr_avg": 1.0, "scr_balance": 1.0, "scr_values": 1.0, "scr_opportunities": 1.0, "scr_benefits": 1.0, "scr_management": 1.0},
{"gld_index": "1-2", "stock_symbol": "AMG", "gld_id": "7172", "date": "2011-09-15", "author_job_title": "Former Employee - Anonymous Employee", "author_location": "Beverly, MA", "txt_main": "Smart, driven, risk-oriented people; intellectually challenging environment; innovator in its industry so there is always something new going on; long hours and stressful at times but very respectful of personal commitments- they strike the right balance; compensation is very good, benefits are phenomenal, and expectations about both are very clear; IT and HR departments are the best I've ever worked with.", "txt_pro": "Smart, driven, risk-oriented people; intellectually challenging environment; innovator in its industry so there is always something new going on; long hours and stressful at times but very respectful of personal commitments- they strike the right balance; compensation is very good, benefits are phenomenal, and expectations about both are very clear; IT and HR departments are the best I've ever worked with.", "txt_con": "The only downside to AMG is that because it is so successful, people don't leave very often, so there is very little upward mobility. It is also a relatively lean organization so there aren't many management levels (a good thing mostly). Thus, if you are a subject expert and are happy with a role that allows you to flourish in your subject area, then this is a great place to be. Similarly, if you want a job that you can leverage into a better opportunity down the road, this is a great stepping stone. However, if you are looking for a place to join and move around or \"climb the ladder,\" you will be frustrated.", "txt_adviceMgmt": null, "rating_recommend": 2, "rating_outlook": null, "rating_ceo": 2, "scr_avg": 4.0, "scr_balance": 5.0, "scr_values": null, "scr_opportunities": 4.0, "scr_benefits": 5.0, "scr_management": 4.5},
{"gld_index": "1-0", "stock_symbol": "MMM", "gld_id": "446", "date": "2017-05-14", "author_job_title": "Current Employee - Technical Aide", "author_location": "Maplewood, MN", "txt_main": "I have been working at 3M part-time (More than 3 years)", "txt_pro": "Respectful treatment, flexible hours, trainings and events, networking", "txt_con": "Not easy to move up, very competitive hiring process (150+ candidates for FT jobs)", "txt_adviceMgmt": null, "rating_recommend": 2, "rating_outlook": 1, "rating_ceo": 2, "scr_avg": 4.0, "scr_balance": 4.0, "scr_values": 5.0, "scr_opportunities": 3.0, "scr_benefits": 3.0, "scr_management": 4.0}
]
Rm4n
  • 623
  • 7
  • 14
Malte Susen
  • 767
  • 3
  • 13
  • 36
  • 1
    You've created a `groupby` object named `group_data`. Now you need to do something with it to get the data you want, some sort of aggregating function like `sum` or `mean`. Or are you really just wanting to sort the dataframe? – cfort Jul 18 '20 at 14:35
  • Thanks for the input. The data is text which is being analysed through a subsequent function. It would hence not be possible to perform any mathematical operations with it – Malte Susen Jul 18 '20 at 14:36
  • 2
    Sounds like you don't want to `groupby`. You probably want `sort_values`. Something like `df.sort_values(['stock_symbol', 'quarter'])` – cfort Jul 18 '20 at 14:38
  • That could be an approach. I however plan to run text analysis iterations based on the "slices data". This means the plan is to run a new iteration for each field of a "matrix" defined by `stock_synmbol` and `quarter` – Malte Susen Jul 18 '20 at 14:41
  • There's no `quarter` column in the data you included with your question – cfort Jul 18 '20 at 14:45
  • That's correct - the `quarter` column is calculated through `df['quarter'] = df['date'].dt.to_period('Q')` – Malte Susen Jul 18 '20 at 14:46
  • 2
    Maybe then you want something like this: https://stackoverflow.com/questions/62092600/how-does-apply-work-on-a-pandas-dataframe. You'd group and then iterate over the groups and apply whatever you want. – ALollz Jul 18 '20 at 14:50
  • Good point, thanks. I had actually tried that before but received an error `TypeError: unhashable type: 'list'` – Malte Susen Jul 18 '20 at 15:04
  • 1
    What's the function you're using in your apply call? Can you add it to the OP? – Rm4n Jul 18 '20 at 15:19
  • Thanks, I have added the function to the question text. It is a function to detect the frequency of specific words within the DataFrame. – Malte Susen Jul 18 '20 at 15:21
  • The output is a list of frequencies (ie a list of numbers for the respective words) – Malte Susen Jul 18 '20 at 15:25
  • Which columns are going to be analyzed? txt_main + txt_pro + txt_con? – Rm4n Jul 18 '20 at 15:34
  • The analysis shall be done for txt_main, txt_pro, txt_con and txt_adviceMgmt – Malte Susen Jul 18 '20 at 15:37

1 Answers1

2

Here's one way to achieve what you're after:

The custom function:

def get_top_n_bigram(row):
    corpus = row['txt_main'] + row['txt_pro'] + row['txt_con'] + row['txt_adviceMgmt']
    n = 2 % the top n
    vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

Call the groupby with apply using the defined function:

df['date'] = pd.to_datetime(df['date'])
df['quarter'] = df['date'].dt.to_period('Q')
newdf = df.groupby(['stock_symbol', 'quarter']).apply(get_top_n_bigram).to_frame(name = 'frequencies')

print(newdf)
                                                  frequencies
stock_symbol quarter                                             
AMG          2011Q3         [(smart driven, 2), (driven risk, 2)]
             2013Q1   [(asset management, 2), (smart working, 1)]
             2014Q1     [(audit firm, 3), (employment agency, 2)]
MMM          2017Q2               [(working 3m, 1), (3m time, 1)]
Rm4n
  • 623
  • 7
  • 14
  • Many thanks - currently still running the code but it looks like you nailed it – Malte Susen Jul 18 '20 at 15:57
  • 1
    It's taking some time as it'a a large data file. I assume it will all run smoothly though in case you have obtained the above outputs. Again, many thanks for your help. – Malte Susen Jul 18 '20 at 16:05