1

I have a pandas DataFrame that contains some array columns. What is the recommended way to index some of these columns by different position indices? For example, from the array column named l I need the second elements, from the array column named a I need the first elements. The results should be a new DataFrame. Array column can either contain a Python list or a Numpy array, but this probably does not matter.

I have three solutions, but I don't really like any of them.

df= pd.DataFrame({'l': [[1, 2, 4], [3, 2, 0, 10]], \
                  'a':[np.array(["foo", "bar", "baz"]), np.array(["qux", "quux"])], \
                  'dontcare': [10, 20]})
               l                a  dontcare
0      [1, 2, 4]  [foo, bar, baz]        10
1  [3, 2, 0, 10]      [qux, quux]        20

Solution 1, with str and join

df['l'].str[1].to_frame('l').join(df['a'].str[0])
   l    a
0  2  foo
1  2  qux

Solution 2, with the function apply and creating Series

df.apply(lambda row: pd.Series([row['l'][1], row['a'][0]], index=['l', 'a']), axis=1)

Solution 3, with apply and broadcast.

df[['l', 'a']].apply(lambda row: [row['l'][1], row['a'][0]], axis=1, result_type='broadcast')

We can assume that the output column names match the input column names and we dont need multiple elements of any array column.

Ferenc Bodon
  • 310
  • 4
  • 12

1 Answers1

1

I think it depends.

First solution is most general, working always if indices not exist - then returned NaNs. But it is also reason for slowiest solution if large DataFrame.

print (df['l'].str[3].to_frame('l').join(df['a'].str[2]))
      l    a
0   NaN  baz
1  10.0  NaN

Another solution with apply should be faster, but fail, if value not exist.

print (df.apply(lambda row: pd.Series([row['l'][3], row['a'][2]], index=['l', 'a']), axis=1))

IndexError: ('list index out of range', 'occurred at index 0')


Another idea if always exist values in lists is use list comprehension (but fail similar like apply, if not exist) with *c for tail:

df= pd.DataFrame({'l': [[1, 2, 4], [3, 2, 0, 10]], \
                  'a':[np.array(["foo", "bar", "baz"]), np.array(["qux", "quux"])], \
                  'dontcare': [10, 20],
                   's': [10, 20], 
                   'b': [10, 20]})
print (df)
               l                a  dontcare   s   b
0      [1, 2, 4]  [foo, bar, baz]        10  10  10
1  [3, 2, 0, 10]      [qux, quux]        20  20  20

df = pd.DataFrame([(a[1], b[0]) for a,b, *c in df.values], columns=['l', 'a'])
print (df)
   l    a
0  2  foo
1  2  qux

Or select column by list for processing:

df = pd.DataFrame([(a[1], b[0]) for a,b in df[['l','a']].values], columns=['l', 'a'])
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thanks for your comment on indexing out-of-bound. Please change the code of your last idea, to avoid modifying the original DataFrame. `pd.DataFrame([(a[1], b[0]) for a,b,c in df.values], columns=['l', 'a'])`. Also, do you need to list all columns? My DataFrame may contain 100 columns. – Ferenc Bodon Nov 14 '19 at 08:37
  • 1
    @FerencBodon - If use `python 3`, then is possible use `*c` for tail - [link](https://stackoverflow.com/questions/10532473/head-and-tail-in-one-line), edited answer. – jezrael Nov 14 '19 at 08:40
  • Your solution is definitely worth listing here - it has lessons to learn. Nevertheless, we need to acknowledge that (1) you either repeat the column names three times or you introduce other names, `a`, `b`, `c`, (2) the `*c` based solution is sensitive to the column order, i.e. if tomorrow somebody changes the column order then the code will silently return false values. – Ferenc Bodon Nov 14 '19 at 09:27
  • 1
    @FerencBodon - hmm, also is not possible use `df = pd.DataFrame([(a[1], b[0]) for a,b in df[['l','a']].values], columns=['l', 'a'])` ? Because generally what is logic for selecting? Get all columns with lists and select e.g. first value of lists? – jezrael Nov 14 '19 at 09:29
  • you are right. I edited my post to make it more specific. Theoretically, we have a list of column name, index pairs, .e.g from column `l`, I am interested in the second element, for column `a` i need the first elements. – Ferenc Bodon Nov 14 '19 at 09:36
  • @FerencBodon - So input is `L = [('l','a'),('l1','a1'), ...]` ? And output in DataFrame with logic `[(a[1], b[0]) for a,b in df[['l','a']].values]` ? – jezrael Nov 14 '19 at 09:41