I have a DataFrame of books that I removed and reworked some information. However, there are some rows in the column "bookISBN" that have duplicate values, and I want to merge all those rows into one.
I plan to make a new DataFrame where I keep the first values for the url, the ISBN, the title and the genre, but I want to sum the values of the column "genreVotes" in order to create the merge. How can I do this?
Original dataframe:
In [23]: network = data[["bookTitle", "bookISBN", "highestVotedGenre", "genreVotes"]]
network.head().to_dict("list")
Out [23]:
{'bookTitle': ['The Hunger Games',
'Twilight',
'The Book Thief',
'Animal Farm',
'The Chronicles of Narnia'],
'bookISBN': ['9780439023481',
'9780316015844',
'9780375831003',
'9780452284241',
'9780066238500'],
'highestVotedGenre': ['Young Adult',
'Young Adult',
'Historical-Historical Fiction',
'Classics',
'Fantasy'],
'genreVotes': [103407, 80856, 59070, 73590, 26376]}
Duplicates:
In [24]: duplicates = network[network.duplicated(subset=["bookISBN"], keep=False)]
duplicates.loc[(duplicates["bookISBN"] == "9780439023481") | (duplicates["bookISBN"] == "9780375831003")]
Out [24]:
{'bookTitle': ['The Hunger Games',
'The Book Thief',
'The Hunger Games',
'The Book Thief',
'The Book Thief'],
'bookISBN': ['9780439023481',
'9780375831003',
'9780439023481',
'9780375831003',
'9780375831003'],
'highestVotedGenre': ['Young Adult',
'Historical-Historical Fiction',
'Young Adult',
'Historical-Historical Fiction',
'Historical-Historical Fiction'],
'genreVotes': [103407, 59070, 103407, 59070, 59070]}
(In this example the votes were all the same but in some cases the values are different).
Expected output:
{'bookTitle': ['The Hunger Games',
'Twilight',
'The Book Thief',
'Animal Farm',
'The Chronicles of Narnia'],
'bookISBN': ['9780439023481',
'9780316015844',
'9780375831003',
'9780452284241',
'9780066238500'],
'highestVotedGenre': ['Young Adult',
'Young Adult',
'Historical-Historical Fiction',
'Classics',
'Fantasy'],
'genreVotes': [260814, 80856, 177210, 73590, 26376]}