Condense similar instances in dataframe while adding the features of each

Question

I am working on a project using Pandas, and am running into trouble when trying to condense similar objects.

I have a dataframe with the columns Product ID, Currency, Price, and Book.

There are identical products in the Product ID category, such as X11, X23, X25, and so on. There are 3 instances of each, each with one of 3 currencies, with a price, and with one of 3 Books.

I want to have a dataframe where each instance is one product, with whatever books they appear in, and the 4 prices of the 4 currencies contained in each instance.

This is an example of what the Dataframe looks like now:

df = pd.DataFrame({'Product ID' : ['X11' ,'X11', 'X11', 'X23', 'X23', 'X23', 'X25', 'X25'], 
                    'Currency' : ['USD', 'EUR', 'GBP', 'USD', 'EUR', 'GBP', 'EUR', 'GBP'],
                    'Price' : [100, 90, 90, 200, 180, 180, 90, 90],
                    'Book' : ['America', 'Canada', 'Mexico', 'America', 'Canada', 'Mexico', 'Canada', 'Mexico']})


df

      Book Currency  Price Product ID
0  America      USD    100        X11
1   Canada      EUR     90        X11
2   Mexico      GBP     90        X11
3  America      USD    200        X23
4   Canada      EUR    180        X23
5   Mexico      GBP    180        X23
6   Canada      EUR     90        X25
7   Mexico      GBP     90        X25

Ultimately It will be converted to a JSON file that has all of that data in one instance, however before that happens I need to condense the identical products. See below for an example of what the final JSON converted object will look like.

What would be the best way of achieving this? I don't fully understand groupby, as some similar questions have suggested, and I haven't seen a question that answers how to do this. The actual conversion itself shouldn't be too hard once I have identical products only occurring in one instance, but with all of the price and book data.

Any help is much appreciated.

@cᴏʟᴅsᴘᴇᴇᴅ my question is more aimed towards condensing the identical products while retaining all of the various prices and books, not the actual CSV to JSON conversion. From my understanding, doing that conversion right now would result in multiple objects for products with the same ID — Warthog1, Jan 10 '18 at 01:37

cs95 · Answer 1 · 2018-01-10T02:00:11.003

1

You can solve this in two stages.

The first stage requires a pivot on 3 columns -

i = df.pivot('Product ID', 'Currency', 'Price')\
      .add_suffix(' Price')\
      .reset_index()\
      .rename_axis(None, 1)

i
  Product ID  EUR Price  GBP Price  USD Price
0        X11       90.0       90.0      100.0
1        X23      180.0      180.0      200.0
2        X25       90.0       90.0        NaN

Next, perform a groupby on Product ID and aggregate the country names in Book -

j = df.groupby('Product ID').Book.agg(', '.join).reset_index()
j

  Product ID                     Book
0        X11  America, Canada, Mexico
1        X23  America, Canada, Mexico
2        X25           Canada, Mexico

Now, merge the two intermediate results -

df = i.merge(j)
df

  Product ID  EUR Price  GBP Price  USD Price                     Book
0        X11       90.0       90.0      100.0  America, Canada, Mexico
1        X23      180.0      180.0      200.0  America, Canada, Mexico
2        X25       90.0       90.0        NaN           Canada, Mexico

Now, convert df to JSON using .to_json in the records format -

df.to_json('file.json', orient='records')

file.json (output)

[
    {
        "GBP Price": 90.0,
        "Book": "America, Canada, Mexico",
        "Product ID": "X11",
        "EUR Price": 90.0,
        "USD Price": 100.0
    },
    {
        "GBP Price": 180.0,
        "Book": "America, Canada, Mexico",
        "Product ID": "X23",
        "EUR Price": 180.0,
        "USD Price": 200.0
    },
    {
        "GBP Price": 90.0,
        "Book": "Canada, Mexico",
        "Product ID": "X25",
        "EUR Price": 90.0,
        "USD Price": NaN
    }
]

edited Jan 10 '18 at 02:00

answered Jan 10 '18 at 01:53

cs95

379,657
97
704
746

@C0LDSPEED Thank you so much for answering. When I tried this, I got the error "Index contains duplicate entries, cannot reshape". – Warthog1 Jan 10 '18 at 02:00
@AlexBozanic Sorry, there were a couple of bugs which I've corrected. – cs95 Jan 10 '18 at 02:00
Thank you again for your help. Somewhat confused on this new error. Is it using Product ID as the index, and that is why its throwing that error? – Warthog1 Jan 10 '18 at 02:13
@AlexBozanic Are you still getting this error? If so, on what line? Can you provide some data to reproduce it? – cs95 Jan 10 '18 at 02:14
I am still getting it. Jupyter Notebook says that the error is on the first line, where the pivot occurs. Conceptually the entire process makes sense, and I think it is the solution that I was looking for, now I just gotta make it work. – Warthog1 Jan 10 '18 at 02:31
When I check to see what my columns are, I see only those 4 columns, so it must be indexing on one of those – Warthog1 Jan 10 '18 at 02:36
@AlexBozanic It's possible you have duplicates, so you might want to do this first: `df = df.drop_duplicates(['Product ID', 'Currency'])` and _then_ pivot. – cs95 Jan 10 '18 at 02:38

rnso · Accepted Answer · 2018-01-10T03:29:36.707

1

Can also use for loop to create desired output:

outlist = []                            # empty output list to be filled
uid = pd.unique(df['Product ID'])       # get unique ID values
for id in uid:
    subdf = df[df['Product ID'] == id]  # get sub-dataframe for one ID
    entry = {}                          # empty dictionary to be filled, one for each ID
    entry["Product Code"] = id          # add ID to dictionary
    entry["Book"] = ", ".join(subdf['Book'].tolist())   # add Book list to dictionary
    for row in subdf.values:            # get subdf as a list of lists
        entry[row[1]+" Price"] = row[2] # add prices to dictionary
    outlist.append(entry)               # add dictionary to outlist

print(outlist)

Output:

[{'Product Code': 'X11', 'GBP Price': 90, 'EUR Price': 90, 'USD Price': 100, 'Book': 'America, Canada, Mexico'}, 
 {'Product Code': 'X23', 'GBP Price': 180, 'EUR Price': 180, 'USD Price': 200, 'Book': 'America, Canada, Mexico'}, 
 {'Product Code': 'X25', 'GBP Price': 90, 'EUR Price': 90, 'Book': 'Canada, Mexico'}]

edited Jan 10 '18 at 03:29

answered Jan 10 '18 at 03:01

rnso

23,686
25
112
234

this seems like it would work, however I get the error "Unsupported operad type(s) for +: 'float' and 'str'. It is coming from the line entry[row[1]+" Price"] = row[2]. Can you provide any guidance as to how to get rid of that? – Warthog1 Jan 10 '18 at 16:22
Print one row and post here. For me row[1] was GBP etc and row[2] was price. – rnso Jan 10 '18 at 16:27
It is likely your row[1] is price column, which contains integer values, hence you are getting this error. You will have to then edit this line: `entry[row[1]+" Price"] = row[2]` with appropriate indices. – rnso Jan 10 '18 at 17:36
Okay, so just to be totally clear, which of those values should I update to the correct index? The correct index being Price – Warthog1 Jan 10 '18 at 18:06
`entry[row[index_of_GBP_etc]+" Price"] = row[index_of_Price]`. Basically, one is creating a key "GBP Price" in entry dictionary and allocating its value to be `row[index_of_price]`. – rnso Jan 10 '18 at 18:16
I see. That makes sense, which is why I'm confused why its not working. I called dtypes on my dataframe, and see that all of the columns are "object" data type, while Price is float64. Should I convert the float64 objects to something else? – Warthog1 Jan 10 '18 at 18:29
Is your code working now? If so, you should upvote/accept the answer. We cannot have multiple comments for extended discussion here. – rnso Jan 10 '18 at 18:31
It is not working but I will continue to try and make it work. I feel I am very close. It has been marked as solved and upvoted. Thank you so much for your help! – Warthog1 Jan 10 '18 at 18:42

Condense similar instances in dataframe while adding the features of each

2 Answers2