3

I have a dataframe with languages as column names, and with 1x final column containing account names:

EN         DE         IT         Account
Milan      Mailand    Milano     Italy
Florence   Florenz    Firenze    Italy
London     London     Londra     UK
Belgrade   Belgrad    Belgrado   World

I would need to extract information from this database, creating all the possible lists based on a combination of column names (languages) and the values within the account column.

For example here the output would be:

EN_Italy = ['Milan', 'Florence']
DE_Italy = ['Mailand', 'Florenz']
IT_Italy = ['Milano', 'Firenze']
EN_UK = ['London']
DE_UK = ['London']
IT_UK = ['Londra']
EN_World = ['Belgrade']
DE_World = ['Belgrad']
IT_World = ['Belgrado']

Would it be possible to do so? Thank you!

rafaelc
  • 57,686
  • 15
  • 58
  • 82
Roberto Bertinetti
  • 555
  • 1
  • 4
  • 10
  • What have you tried so far? What is the difference between your question and this one for exemple? https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas – OriolAbril Apr 19 '18 at 13:54

4 Answers4

3

You can aggregate():

df = df.groupby("Account").aggregate(lambda k: list(k)).reset_index()

  Account            DE                 EN                 IT
0  Italy  [Mailand, Florenz]  [Milan, Florence]  [Milano, Firenze]
1     UK            [London]           [London]           [Londra]
2  World           [Belgrad]                [b]         [Belgrado]

To get your list, do a simple selection, e.g.

df[df.Account == "Italy"]["DE"]

0    [Mailand, Florenz]
rafaelc
  • 57,686
  • 15
  • 58
  • 82
3

For a variable number of variables, a dictionary is often a good choice.

You can use collections.defaultdict for this:

from collections import defaultdict

d = defaultdict(list)

for row in df.itertuples():
    for i in row._fields[1:-1]:
        d[i+'_'+row.Account].append(getattr(row, i))

Result

defaultdict(list,
            {'DE_Italy': ['Mailand', 'Florenz'],
             'DE_UK': ['London'],
             'DE_World': ['Belgrad'],
             'EN_Italy': ['Milan', 'Florence'],
             'EN_UK': ['London'],
             'EN_World': ['Belgrade'],
             'IT_Italy': ['Milano', 'Firenze'],
             'IT_UK': ['Londra'],
             'IT_World': ['Belgrado']})

Explanation

  • Initialise a defaultdict of lists.
  • Iterate rows of your dataframe, and for each row iterate language column names.
  • For the inner loop, append dictionary items as required.
jpp
  • 159,742
  • 34
  • 281
  • 339
3

Using stack

df.set_index('Account').unstack().groupby(level=[0, 1]).apply(list)

    Account
EN  Italy       [Milan, Florence]
    UK                   [London]
    World              [Belgrade]
DE  Italy      [Mailand, Florenz]
    UK                   [London]
    World               [Belgrad]
IT  Italy       [Milano, Firenze]
    UK                   [Londra]
    World              [Belgrado]
dtype: object


d = df.set_index('Account').ustack().groupby(level=[0, 1]).apply(list)
d.index = d.index.map('_'.join)

d

EN_Italy     [Milan, Florence]
EN_UK                 [London]
EN_World            [Belgrade]
DE_Italy    [Mailand, Florenz]
DE_UK                 [London]
DE_World             [Belgrad]
IT_Italy     [Milano, Firenze]
IT_UK                 [Londra]
IT_World            [Belgrado]
dtype: object

Or

d.to_dict()

{'DE_Italy': ['Mailand', 'Florenz'],
 'DE_UK': ['London'],
 'DE_World': ['Belgrad'],
 'EN_Italy': ['Milan', 'Florence'],
 'EN_UK': ['London'],
 'EN_World': ['Belgrade'],
 'IT_Italy': ['Milano', 'Firenze'],
 'IT_UK': ['Londra'],
 'IT_World': ['Belgrado']}
piRSquared
  • 285,575
  • 57
  • 475
  • 624
1

Just another approach with dict comprehension:

accts = df['Account']

langs = [col for col in df.columns if col != 'Account']

result = {'{}_{}'.format(lang, acct): df.loc[df['Account']==acct, lang].tolist() \
          for lang in langs for acct in accts}
zipa
  • 27,316
  • 6
  • 40
  • 58