0

This is a follow up of this question: Extract non- empty values from the regex array output in python

I have a DF with columns "col" and "col1" of type 'numpy.ndarray' and looks like :

       col                         col1
   [[5, , , ,]]             [qwe,ret,der,po]
   [[, 4, , ,][, , 5, ]]       [fgk,hfrt]
        []                           []
   [[, , , 9]]                  [test]  

I want my output as:

      col  col1
       5  qwe,ret,der,po
       5  fgk,hfrt
       0  NOT FOUND 
       9  test

Please note column "col", second row has maximum of the two entries in the output. I tried the solution provided in the above link but its giving ValueError "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()"

Thanks

Edit: Dictionary form of my DF with column "col":

  {'col': {0: array([['5', '', '', '', '', '']],
  dtype='|S1'), 1: array([], dtype=float64), 2: array([], dtype=float64), 3: array([], dtype=float64), 4: array([], dtype=float64), 5: array([['8', '', '', '', '', '']],
  dtype='|S1'), 6: array([], dtype=float64), 7: array([], dtype=float64), 8: array([], dtype=float64), 9: array([], dtype=float64), 10: array([], dtype=float64), 11: array([['', '8', '', '', '', '']],
  dtype='|S1'), 12: array([], dtype=float64), 13: array([], dtype=float64), 14: array([], dtype=float64), 15: array([['7', '', '', '', '', '']],
  dtype='|S1'), 16: array([], dtype=float64)}}
Community
  • 1
  • 1
user4349490
  • 153
  • 1
  • 8

1 Answers1

0

Try the following:

import pandas as pd


def parse_nested_max(xss):
    return max(
        (max((int(x) for x in xs if x), default=0) for xs in xss),
        default=0
    )


df['col'] = df.col.apply(parse_nested_max)
df['col1'] = df.col1.apply(lambda s: ','.join(s) or 'NOT FOUND')

This assumes that the first column is a 2-dim array of type string, and the second is 1-dim array of type string.

For the first column, do the following:

  1. For each subarray, drop '' elements and convert rest to int
  2. For each subarray, compute max with the convention that max([]) == 0
  3. Finally, this gives a list of integers, so simply take the max; use default=0 to account for possibility of emptiness like in third row of your df.

For the second column, exploit the fact that bool(','.join([])) == False.

Finally a tip: you will have better feedback if your dataframe is easy to recreate. Try using df.to_dict() and embedding the output in your source when you define df.

hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51
  • I am getting an error while defining "parse_nested_max" function: "SyntaxError: Generator expression must be parenthesized if not sole argument". – user4349490 May 09 '16 at 14:41
  • getting error while applying parse_nested_max function on df : "TypeError: max() got an unexpected keyword argument". I have edited the question to include the dictionary form of df. Also is it necessary to assume 2-dim array of type string? thanks – user4349490 May 09 '16 at 15:02
  • It appears that `max` does not have default argument in Python 2, I assume that is what you are using? Also, creating the dataframe from the `dict` above results in 16 rows instead of 4. – hilberts_drinking_problem May 09 '16 at 15:08
  • yes I am using python 2. This was my actual DF. Is there a way to get around this in python 2. – user4349490 May 09 '16 at 16:59