1

I have a sample data look like this (real dataset has more columns):

data = {'stringID':['AB CD Efdadasfd','RFDS EDSfdsadf dsa','FDSADFDSADFFDSA'],'IDct':[1,3,4]}
data = pd.DataFrame(data)
data['Index1'] = [[3,6],[7,9],[5,6]]
data['Index2'] = [[4,8],[10,13],[8,9]]

enter image description here

What i want to achieve is i want to slice stringID column based on second elment in Index1 and Index2 (both are list), only if IDct value is bigger than 1, otherwise return NaN.

I tried this, it works as Output1 column, but there must be a better way (i mean faster when apply to a large dataset) to do it, please kindly advise, thanks!

data['pos'] = data.Index1.map(lambda x: x[1])
data['pos1'] = data.Index2.map(lambda x: x[1])

def cal(m):
    if m['IDct'] > 1:
        return m['stringID'][m['pos']:m['pos1']]
    else:
        return 'NaN'

data['Output1'] = data.apply(cal,axis=1)

enter image description here

April
  • 93
  • 8
  • 1
    You say there "must be a better way to do it". In your case, what would define a "better" way? What is the problem you have with the current method? Memory efficiency, time efficiency, etc? – G. Anderson Sep 24 '20 at 19:39
  • I'm thinking a clearer or faster way, if that makes sense. Like calculation time if apply to a very large data set. – April Sep 24 '20 at 19:40
  • 3
    Here is a [really, really good overview](https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care) of some times when native pandas methods are best, when loops or apply are just as good, and when to drop back to regular old python – G. Anderson Sep 24 '20 at 21:20

1 Answers1

1

I love pandas - but realistically speaking it's just one of many tools that belong in your tool belt.

pandas and numpy really shine for computation and analysis. It's okay to use pandas to visualize and analyze your data - but that doesn't mean it's the right tool for the job.

This kind of problem is better suited for regular python. Assuming we can, let's move StringID and IDct out of the dict and back into lists. If we assume the result is regular in shape (all lists are of equal length)

StringID = ['AB CD Efdadasfd','RFDS EDSfdsadf dsa','FDSADFDSADFFDSA'],
IDct = [1,3,4]
Index1 = [[3,6],[7,9],[5,6]]
Index2 = [[4,8],[10,13],[8,9]]

for stringID, IDct, Index1, Index2 in zip(stringID, IDct, Index1, Index2):
    result = []
    if IDct > 1:
       result.append(your_indexing_goes_here())
    else:
       result.append(None) 

You can then blend the result data back in as you see fit.

data = {
    'StringID': StringID,
    'IDct': IDct,
    'Index1': Index1,
    'Index2': Index2,
    'Result': result
}

pd.DataFrame(data)
Yvan Aquino
  • 106
  • 6
  • Thank you! I do have a follow up question if lists are with dynamic length: for example i want to pick out second element of the list but some lists only got one value in it. I tried np.where(data['IDct']>1, data.Index1.map(lambda x: x[1]),0) or np.where(data['IDct']>1, [x[1] for x in data['Index1']],0) but all got error of list index out of range... – April Sep 24 '20 at 21:22
  • Use regular Python logic - simple is better. If Index1 and Index2 are of variable length then you use their lengths to draw conclusions on what to do. IE if len(Index1) < 1: None/NaN, elif len(Index1) = 1: Index[0], else: Index[1] . – Yvan Aquino Sep 24 '20 at 21:35
  • Thanks! I tried data.loc[data['IDct']>1]['Index1'].apply(lambda x:x[1]) and it worked as well! – April Sep 25 '20 at 15:12