1

I'm having difficulty getting the following complex list comprehension to work as expected. It's a double nested for loop with conditionals.

Let me first explain what I'm doing:

import pandas as pd

dict1 = {'stringA':['ABCDBAABDCBD','BBXB'], 'stringB':['ABDCXXXBDDDD', 'AAAB'], 'num':[42, 13]}

df = pd.DataFrame(dict1)
print(df)
        stringA       stringB  num
0  ABCDBAABDCBD  ABDCXXXBDDDD   42
1          BBXB          AAAB   13

This DataFrame has two columns stringA and stringB with strings containing characters A, B, C, D, X. By definition, these two strings have the same length.

Based on these two columns, I create dictionaries such that stringA begins at index 0, and stringB begins at the index starting at num.

Here's the function I use:

def create_translation(x):
    x['translated_dictionary'] = {i: i +x['num'] for i, e in enumerate(x['stringA'])}
    return x

df2 = df.apply(create_translation, axis=1).groupby('stringA')['translated_dictionary']


df2.head()
0    {0: 42, 1: 43, 2: 44, 3: 45, 4: 46, 5: 47, 6: ...
1                         {0: 13, 1: 14, 2: 15, 3: 16}
Name: translated_dictionary, dtype: object

print(df2.head()[0])
{0: 42, 1: 43, 2: 44, 3: 45, 4: 46, 5: 47, 6: 48, 7: 49, 8: 50, 9: 51, 10: 52, 11: 53}

print(df2.head()[1])
{0: 13, 1: 14, 2: 15, 3: 16}

That's correct.

However, there are 'X' characters in these strings. That requires a special rule: If X is in stringA, don't create a key-value pair in the dictionary. If X is in stringB, then the value should not be i + x['num'] but -500.

I tried the following list comprehension:

def try1(x):
    for count, element in enumerate(x['stringB']):
        x['translated_dictionary'] = {i: -500 if element == 'X' else  i + x['num'] for i, e in enumerate(x['stringA']) if e != 'X'}
    return x

That gives the wrong answer.

df3 = df.apply(try1, axis=1).groupby('stringA')['translated_dictionary']

print(df3.head()[0]) ## this is wrong!
{0: 42, 1: 43, 2: 44, 3: 45, 4: 46, 5: 47, 6: 48, 7: 49, 8: 50, 9: 51, 10: 52, 11: 53}

print(df3.head()[1])   ## this is correct! There is no key for 2:15!
{0: 13, 1: 14, 3: 16}

There are no -500 values!

The correct answer is:

print(df3.head()[0])
{0: 42, 1: 43, 2: 44, 3: 45, 4:-500, 5:-500, 6:-500, 7: 49, 8: 50, 9: 51, 10: 52, 11: 53}

print(df3.head()[1])
{0: 13, 1: 14, 3: 16}
ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234
  • Why does your last example have 13, 14, 16 instead of 13, 14, 15? – John Zwinck Oct 06 '18 at 23:07
  • @JohnZwinck That's based on the first rule. "If X is in stringA, don't create a key-value pair in the dictionary." In this case, `BBXB` has an X at 2:15. Does this make sense? – ShanZhengYang Oct 06 '18 at 23:24

2 Answers2

1

Here's a simple way, without any comprehensions (because they aren't helping clarify the code):

def create_translation(x):
    out = {}
    num = x['num']
    for i, (a, b) in enumerate(zip(x['stringA'], x['stringB'])):
        if a == 'X':
            pass
        elif b == 'X':
            out[i] = -500
        else:
            out[i] = num
        num += 1
    x['translated_dictionary'] = out
    return x
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
0

Why not flatten your df , you can check with this post and recreate the dict

n=df.stringA.str.len()
newdf=pd.DataFrame({'num':df.num.repeat(n),'stringA':sum(list(map(list,df.stringA)),[]),'stringB':sum(list(map(list,df.stringB)),[])})


newdf=newdf.loc[newdf.stringA!='X'].copy()# remove stringA value X
newdf['value']=newdf.groupby('num').cumcount()+newdf.num # using groupby create the cumcount 
newdf.loc[newdf.stringB=='X','value']=-500# assign -500 when stringB is X
[dict(zip(x.groupby('num').cumcount(),x['value']))for _,x in newdf.groupby('num')] # create the dict for different num by group
Out[390]: 
[{0: 13, 1: 14, 2: 15},
 {0: 42,
  1: 43,
  2: 44,
  3: 45,
  4: -500,
  5: -500,
  6: -500,
  7: 49,
  8: 50,
  9: 51,
  10: 52,
  11: 53}]
BENY
  • 317,841
  • 20
  • 164
  • 234