2

I'm new to pandas and would like to analyse some data arranged like this:

label                     aa                        bb
index
    0           [2, 5, 1, 4]          [x1, x2, y1, z1]
    1             [3, 3, 19]              [x3, x4, y2]
    2    [6, 4, 2, 8, 9, 10]  [y1, y2, z3, z4, x1,  w]

in which x1,x2,x3,x4 are of type M; y1,y2 are of type N; and z1,z2,z3,z4 are of type O. Note that data[2,'bb'] is w, which does not belong to any type. This relationship is defined in mongodb as follows

{'_id' : ObjectId(x1), type : 'M'}
{'_id' : ObjectId(y1), type : 'N'}
{'_id' : ObjectId(z1), type : 'O'}...
db.data.find({'_id' : ObjectId(w)}) is null

The desired output would be like this:

label     sum_M   sum_N   sum_O
index
    0         7       1       4
    1         6      19       0
    2         9      10      10

Does anyone know how to do this with pandas?

Gordon Bean
  • 4,272
  • 1
  • 32
  • 47
Philip
  • 165
  • 1
  • 12
  • Can you provide code that generates the example data? It's not clear what your data actually are, and therefore it is difficult to provide instructions for how to do what you need. – Gordon Bean Sep 28 '16 at 18:44
  • The example Dataframe is generated from a list of mongodb documents which have the following format `{_id:id, amount:[2,5,1,4], numbering:[x1,x2,y1,z1]}` by `pandas.DataFrame(list(collection.find({}))` – Philip Sep 28 '16 at 18:54

1 Answers1

2

Pandas works best when your data are in a table format and individual cells contain values, not collections. To use pandas effectively for your problem, you need to change the way you create your data table.

Ultimately, it looks like you want to generate a table with columns representing object "id", "amount", and "numbering".

    id     amount    numbering
0   abc    2         x1
1   abc    5         x2
2   abc    1         y1
3   abc    4         z1
4   def    3         x3
etc.

To create this table, you can probably use a list of dictionaries, each dictionary containing the data for a row in your table, e.g.:

{'id':'abc', 'amount': 2, 'numbering':'x1'}

You can construct a pandas DataFrame from this list: Convert list of dictionaries to Dataframe

Then you would add a column that represents the type associated with each "numbering" value:

data['dbtype'] = data.numbering.map(lambda num: {'x':'M','y':'N','z':'O'}.get(num[0], 'None'))

Then you would use groupby:

data.groupby('dbtype').sum()
Community
  • 1
  • 1
Gordon Bean
  • 4,272
  • 1
  • 32
  • 47
  • Thanks! I tried `expand = data.apply(lambda x: Series(x['amount']), axis=1)` and then `melt = pd.melt(expand)`. However, since the length of the list is different in this case, I don't think expand & melt is a good idea to solve this question. Just like you said, pre-treatment of data probably is the best solution here. – Philip Sep 28 '16 at 20:19