Slice a dataframe based on conditions of column names and values

Question

I have a dataframe with languages as column names, and with 1x final column containing account names:

EN         DE         IT         Account
Milan      Mailand    Milano     Italy
Florence   Florenz    Firenze    Italy
London     London     Londra     UK
Belgrade   Belgrad    Belgrado   World

I would need to extract information from this database, creating all the possible lists based on a combination of column names (languages) and the values within the account column.

For example here the output would be:

EN_Italy = ['Milan', 'Florence']
DE_Italy = ['Mailand', 'Florenz']
IT_Italy = ['Milano', 'Firenze']
EN_UK = ['London']
DE_UK = ['London']
IT_UK = ['Londra']
EN_World = ['Belgrade']
DE_World = ['Belgrad']
IT_World = ['Belgrado']

Would it be possible to do so? Thank you!

What have you tried so far? What is the difference between your question and this one for exemple? https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas — OriolAbril, Apr 19 '18 at 13:54

score 3 · Answer 1 · answered Apr 19 '18 at 13:58

You can aggregate():

df = df.groupby("Account").aggregate(lambda k: list(k)).reset_index()

  Account            DE                 EN                 IT
0  Italy  [Mailand, Florenz]  [Milan, Florence]  [Milano, Firenze]
1     UK            [London]           [London]           [Londra]
2  World           [Belgrad]                [b]         [Belgrado]

To get your list, do a simple selection, e.g.

df[df.Account == "Italy"]["DE"]

0    [Mailand, Florenz]

jpp · Answer 2 · 2018-04-19T14:04:07.403

For a variable number of variables, a dictionary is often a good choice.

You can use collections.defaultdict for this:

from collections import defaultdict

d = defaultdict(list)

for row in df.itertuples():
    for i in row._fields[1:-1]:
        d[i+'_'+row.Account].append(getattr(row, i))

Result

defaultdict(list,
            {'DE_Italy': ['Mailand', 'Florenz'],
             'DE_UK': ['London'],
             'DE_World': ['Belgrad'],
             'EN_Italy': ['Milan', 'Florence'],
             'EN_UK': ['London'],
             'EN_World': ['Belgrade'],
             'IT_Italy': ['Milano', 'Firenze'],
             'IT_UK': ['Londra'],
             'IT_World': ['Belgrado']})

Explanation

Initialise a defaultdict of lists.
Iterate rows of your dataframe, and for each row iterate language column names.
For the inner loop, append dictionary items as required.

piRSquared · Answer 3 · 2018-04-19T15:46:34.530

Using stack

df.set_index('Account').unstack().groupby(level=[0, 1]).apply(list)

    Account
EN  Italy       [Milan, Florence]
    UK                   [London]
    World              [Belgrade]
DE  Italy      [Mailand, Florenz]
    UK                   [London]
    World               [Belgrad]
IT  Italy       [Milano, Firenze]
    UK                   [Londra]
    World              [Belgrado]
dtype: object


d = df.set_index('Account').ustack().groupby(level=[0, 1]).apply(list)
d.index = d.index.map('_'.join)

d

EN_Italy     [Milan, Florence]
EN_UK                 [London]
EN_World            [Belgrade]
DE_Italy    [Mailand, Florenz]
DE_UK                 [London]
DE_World             [Belgrad]
IT_Italy     [Milano, Firenze]
IT_UK                 [Londra]
IT_World            [Belgrado]
dtype: object

Or

d.to_dict()

{'DE_Italy': ['Mailand', 'Florenz'],
 'DE_UK': ['London'],
 'DE_World': ['Belgrad'],
 'EN_Italy': ['Milan', 'Florence'],
 'EN_UK': ['London'],
 'EN_World': ['Belgrade'],
 'IT_Italy': ['Milano', 'Firenze'],
 'IT_UK': ['Londra'],
 'IT_World': ['Belgrado']}

Is there a way to invert the columns of the object before koining with '_'? So it would become DE_Italy And not Italy_DE — Roberto Bertinetti, Apr 19 '18 at 15:23
Yes, several ways. Simplest way is to use `unstack` instead of `stack` — piRSquared, Apr 19 '18 at 15:45

score 1 · Answer 4 · answered Apr 19 '18 at 14:05

Just another approach with dict comprehension:

accts = df['Account']

langs = [col for col in df.columns if col != 'Account']

result = {'{}_{}'.format(lang, acct): df.loc[df['Account']==acct, lang].tolist() \
          for lang in langs for acct in accts}

Slice a dataframe based on conditions of column names and values

4 Answers4