1

I am trying to convert this code -

    for item in data['item'].unique():
     response = process_item(item) # returns List[Dict[Text, Optional[int]]]
     response = pd.DataFrame(response)
     response['item'] = item
     final_response = final_response.append(response)

to something like -

    data = data[['item']].drop_duplicates().reset_index(drop=True)
    final_response = data[['item']].apply(lambda x: process_item(x))
    final_response['item'] = data['item']

The idea is to later use dask to parallel process the apply on the dataframe.

I tried returning a pd.DataFrame from process_item but I get ValueError: If using all scalar values, you must pass an index

response looks something like this -

   A       B         C
0  456  foo bar     123.0

How do I resolve the ValueError and is my assumption that apply will append the output df from process_item to final_response correct?

EDIT: Added sample data

Wrapping output from process_item in pd.Series -

#output from process_item
{'A': [456, 789], 'B': ['foo bar', 'dog bar'], 'C': [123.0, 160.0]}

#printing ouput in pd.Series
A        [456, 789]
B        [foo bar, dog bar]
C        [123.0, 160.0]

#Adding a new 'item' column
          A             B           C                    item
0  [456, 789]  [foo bar, dog bar]  [123.0, 160.0]         bar

The below is from the first code snippet -

#output from process_item
{'A': [456, 789], 'B': ['foo bar', 'dog bar'], 'C': [123.0, 160.0]}

#output from process_item in pd.DataFrame
    A      B          C
0  456  foo bar     123.0
1  789  dog bar     160.0

#Adding a new 'item' column
            A              B               C           item
0          456          foo bar          123.0         bar
1          789          dog bar          160.0         bar

I need the item added as per the second example.

EDIT(solved): I was finally able to get this to work with some changes in the split_dataframe_rows function shared by @yugandhar. 1. Calculating max_split - it was calculating the length of the newly added 'item' column, which had 'bar' in it, so that was evaluating to 3, where as the other lists contained only two elements, added a type check. 2. split_rows[column_selector].pop(0) was throwing an error for the 'item' column saying str object did not have a pop attribute. So, added a check to do this only if it was a list, otherwise just assign. Tested with your updated solution as well and works fine. Not sure why these issues did not come up on the colab, may be difference of python versions or something. I tried explode, but it does not work for me either, I guess I am not using pandas 0.25. I will continue to look for better ways to do the split.

rxtechsbay
  • 45
  • 1
  • 4
  • [Never call DataFrame.append or pd.concat inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait Oct 08 '19 at 22:57

2 Answers2

0

If I understand it correctly then you need to make the following changes:
Return pd.Series instead of pd.DataFrame,
use data['item'] to get values(this is what you need for apply) in a column and
data[['item']] to get a dataframe with index and item columns
Working Solution

Community
  • 1
  • 1
Yugandhar
  • 541
  • 4
  • 6
  • Thanks @yugandhar, this works perfectly for a scalar list, but the output I am getting from process_item is a List of Dicts something like this - ```List[Dict[Text, Optional[int]]]``` Sorry I should have been more clear about the data. Updated the question. – rxtechsbay Oct 08 '19 at 21:41
  • @rxtechsbay Check this [Updated Solution](https://colab.research.google.com/gist/Yugandhartripathi/843bd828eebe969f2e3f0089b03249dc/so-solution.ipynb) picked up the splitting function from [here](https://gist.github.com/jlln/338b4b0b55bd6984f883) because Pandas df.explode function was not working and idk why. Does this solve your problem? – Yugandhar Oct 08 '19 at 22:40
  • I am getting this error ```TypeError: _split_list_to_rows() got an unexpected keyword argument 'axis'``` on this line - ```df.apply(_split_list_to_rows,axis=1,args = (new_rows,column_selectors))``` Not sure why. Thoughts? – rxtechsbay Oct 09 '19 at 18:33
  • @rxtechsbay you are probably passing a pd.Series as df argument in split_dataframe_rows and Series.apply() unlike DataFrame.apply() does not support an axis argument – Yugandhar Oct 09 '19 at 18:50
  • The last three lines in my code are the same as in your updated solution, only that I am calling a lambda function in apply that returns the response wrapped in a pd.Series. ```process_item(item) {return pd.Series({'A': [456, 789], 'B': ['foo bar', 'dog bar'], 'C': [123.0, 160.0]})}``` Looks like in your case, final_response seems to be seen by python as a df, whereas in my case it is being seen as a series. – rxtechsbay Oct 09 '19 at 19:13
  • @rxtechsbay can you share a colab link or something of the code I don't know why it's not working for you I even tried my code calling the function process_item you can check it out in Updated solution link and it's working just like before – Yugandhar Oct 09 '19 at 19:31
  • Also, if I wrap the final_response in a pd.DataFrame, I get this error - ```AttributeError: ("'str' object has no attribute 'pop'", u'occurred at index 0')``` – rxtechsbay Oct 09 '19 at 19:35
  • was able to finally get it working, added details with edit on the question. Thanks! – rxtechsbay Oct 09 '19 at 21:34
  • Nice, I did not account for variable sizes. – Yugandhar Oct 09 '19 at 21:37
0

Consider a list comprehension to build a list of data frames to be concatenated at the end:

dfs = [(pd.DataFrame(process_item(i)) 
          .assign(item = i) 
       ) for i in data['item'].unique()]

final_df = pd.concat(dfs, ignore_index=True)
ladylala
  • 223
  • 1
  • 6
Parfait
  • 104,375
  • 17
  • 94
  • 125