3

I am using pandas and uproot to read data from a .root file, and I get a table like the following one:

enter image description here

The aforementioned table is made with the following code:

fname = 'ZZ4lAnalysis_VBFH.root' 
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
branches = ['Z1Flav', 'Z2Flav', 'nCleanedJetsPt30', 'LepPt', 'LepLepId'] 
df = ttree.pandas.df(branches, flatten=False)

I need to find the maximum value in LepPt, and, once found the maximum, I also need to retrieve the LepLepId of that maximum value. I have no problem in finding the maximum values:

Pt_l1 = [max(i) for i in df.LepPt]

In this way I get an array with all the maximum values. However, I have to separate such values according to the LepLepId. So I need an array with the maximum LepPt and |LepLepId|=11 and one with the maximum LepPt and |LepLepId|=13.

If someone could give me any hint, advice and/or suggestion, I would be very grateful.

aleolomorfo
  • 133
  • 1
  • 5
  • 1
    `groupby` or `idxmax` – G. Anderson Feb 06 '20 at 21:37
  • 1
    Please do not share information as images unless absolutely necessary. See: https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors – AMC Feb 07 '20 at 02:50
  • Have you tried anything, done any research? Which part are you struggling with? I'm also curious as to why you're storing list or arrays, instead of having each element in a column. – AMC Feb 07 '20 at 02:51

2 Answers2

2

I made some mock data since you didn't provide yours in any easy format. I think this is what you are looking for.

import pandas as pd

df = pd.DataFrame.from_records(
    [   [[1,2,3], [4,5,6]],
        [[4,6,5], [7,8,9]]
    ],
    columns=['LepPt', 'LepLepld']
)

df['max_LepPt'] = [max(i) for i in df.LepPt]

def f(row):
    # get index position within list
    pos = row['LepPt'].index(row['max_LepPt']).tolist()
    return row['LepLepld'][pos]

df['same_index_LepLepld'] = df.apply(lambda x: f(x), axis=1)

returns:

    LepPt       LepLepld    max_LepPt   same_index_LepLepld
0   [1, 2, 3]   [4, 5, 6]   3           6
1   [4, 6, 5]   [7, 8, 9]   6           8
ak_slick
  • 1,006
  • 6
  • 19
  • Hi ak_slick! Yes, that's exactly what I need. I have only one problem, I get the _AttributeError_: ** ("'numpy.ndarray' object has no attribute 'index'", u'occurred at index 0') **. What could be the problem? Sorry, I used to code in C++ and I am new to python, so I don't master this language at all. – aleolomorfo Feb 07 '20 at 09:45
  • Updated answer with your fix. Good Catch, – ak_slick Feb 07 '20 at 18:04
2

You could use the awkward.JaggedArray interface for this (one of the dependencies of uproot), which allows you to have irregularly sized arrays.

For this you would need to slightly change the way you load the data, but it allows you to use the same methods you would use with a normal numpy array, namely argmax:

fname = 'ZZ4lAnalysis_VBFH.root' 
key = 'ZZTree/candTree'
ttree = uproot.open(fname)[key]
# branches = ['Z1Flav', 'Z2Flav', 'nCleanedJetsPt30', 'LepPt', 'LepLepId']
branches = ['LepPt', 'LepLepId']   # to save memory, only load what you need

# df = ttree.pandas.df(branches, flatten=False)
a = ttree.arrays(branches)    # use awkward array interface

max_pt_idx = a[b'LepPt'].argmax()
max_pt_lepton_id = a[b'LepLepld'][max_pt_idx].flatten()

This is then just a normal numpy array, which you can assign to a column of a pandas dataframe if you want to. It should have the right dimensionality and order. It should also be faster than using the built-in Python functions.

Note that the keys are bytestrings, instead of normal strings and that you will have to take some extra steps if there are events with no leptons (in which case the flatten will ignore those empty events, destroying the alignment).

Alternatively, you can also convert the columns afterwards:

import awkward

df = ttree.pandas.df(branches, flatten=False)

max_pt_idx = awkward.fromiter(df["LepPt"]).argmax()
lepton_id = awkward.fromiter(df["LepLepld"])
df["max_pt_lepton_id"] = lepton_id[max_pt_idx].flatten()

The former will be faster if you don't need the columns again afterwards, otherwise the latter might be better.

Graipher
  • 6,891
  • 27
  • 47