0

I have the following pandas dataframe df:

  Book_Category |   Book_Title                       |  Revenue 
  Thriller        You don't know what I have done       200
  Romance         Last Summer I loved you               100

I am trying to find a way to create a new dataframe, by word in the Book Title (please note that lower and upper case should not matter)

This is the end goal df2:

Book_Title_word   | Revenue 
you                   300
I                     300
don't                 200
know                  200
what                  200
have                  200
done                  200
last                  100
summer                100
loved                 100

Because the words I and you were in both titles, the revenue was summed for them.

Is this feasible in python?

Thank you very much

UPDATE:

Because I am using larger numbers, when using the revenue provided by A-Za-z is in scientific notation fromat ('2.155051e-01').

Book_Category |   Book_Title                       |  Revenue  | Quantity
  A               ...what ...                          3459283      45757
  B               what ...                             4376899      35657
  C               .....what                            4567856      7689

df_new = pd.DataFrame(df['Book_Title'].str.split(' ').tolist(),  index=df['Revenue']).stack().reset_index()[[0, 'Revenue']]
df_new.columns = ['Book_Title_word', 'Revenue']
df_new.Book_Title_word = df_new.Book_Title_word.str.lower()

df_new.groupby('Book_Title_word').sum().sort_values(by = 'Revenue',ascending = False)

Book_Title_word   |   Revenue 
what                 2.160651e-01

This fixed the issue

pd.set_option('display.float_format', lambda x: '%.3f' % x) 

from this answer Format / Suppress Scientific Notation from Python Pandas Aggregation Results

Community
  • 1
  • 1
jeangelj
  • 4,338
  • 16
  • 54
  • 98

2 Answers2

1

I am not sure if this is computationally optimal (due to some quadratic researches and lambda function...), but at least it's short enough:

new_df = pd.DataFrame({'Book_Title_word': pd.unique(np.concatenate(df.Book_Title.str.lower().str.split()))})
new_df['Revenue'] = new_df.Book_Title_word.apply(lambda x: df.loc[df.Book_Title.str.lower().str.contains(x), 'Revenue'].sum())

The first line creates a new dataframe with only one column given by the concatenation of all the words in the titles, in lower case, and removing duplicates. The second line then goes through every one of these elements and checks which lines have a title that (after putting it to lower case) contains that word; if so gets the revenue, and then sums all the results together.

EDIT: As made me notice by the comment below, the above does not work if a title contains special characters for regex, such as *, +, parentheses, etc. Therefore regex should be explicitly excluded by replacing the second line with:

new_df['Revenue'] = new_df.Book_Title_word.apply(lambda x: df.loc[df.Book_Title.str.lower().str.contains(x, regex=False), 'Revenue'].sum())
Marco Spinaci
  • 1,750
  • 15
  • 22
  • thank you very much - I get the following error "error: nothing to repeat" for the second line – jeangelj Mar 29 '17 at 18:56
  • With the provided df I do not have such an error, maybe you are using different data? Looking at this thread: http://stackoverflow.com/questions/28606617/pandas-python-regex-error-nothing-to-repeat it looks like this might be caused by a regex, indeed I can replicate the error adding a * or a + to one title. Anyway, there is an easy fix, which I actually should have though of already: instead of `str.contains(x)`, try using `str.contains(x, regex=False)` to avoid using regex in general (in general they could mess up if the titles contain special characters such as *, +, parentheses, etc.). – Marco Spinaci Mar 30 '17 at 22:18
1

Here is one way. Split the book title column by word to create a new dataframe

df_new = pd.DataFrame(df['Book_Title'].str.split(' ').tolist(),  index=df['Revenue']).stack().reset_index()[[0, 'Revenue']]
df_new.columns = ['Book_Title_word', 'Revenue']
df_new.Book_Title_word = df_new.Book_Title_word.str.lower()

Now use groupby to find the total revenue per each word

df_new.groupby('Book_Title_word').sum().sort_values(by = 'Revenue',ascending = False)


                Revenue
Book_Title_word 
i               300
you             300
don't           200
done            200
have            200
know            200
what            200
last            100
loved           100
summer          100
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • thank you for this answer - it does work, but the numbers I get for revenue are in this format 2.155051e-01- is there a way to avoid that? Also, if I would like to add up two metrics - for example Revenue and Quantity_bought; can I just add Quantity_bought to the index? – jeangelj Mar 29 '17 at 18:26
  • Can you provide a snippet of the example you are using? It would be easier to understand the issue – Vaishali Mar 29 '17 at 18:28
  • thank you - I added an example of the data; I would like to both add up revenue and quantity and avoid the scientific notation format; can I use .format()? – jeangelj Mar 29 '17 at 18:43