I am trying to convert this code -
for item in data['item'].unique():
response = process_item(item) # returns List[Dict[Text, Optional[int]]]
response = pd.DataFrame(response)
response['item'] = item
final_response = final_response.append(response)
to something like -
data = data[['item']].drop_duplicates().reset_index(drop=True)
final_response = data[['item']].apply(lambda x: process_item(x))
final_response['item'] = data['item']
The idea is to later use dask to parallel process the apply on the dataframe.
I tried returning a pd.DataFrame from process_item but I get ValueError: If using all scalar values, you must pass an index
response looks something like this -
A B C
0 456 foo bar 123.0
How do I resolve the ValueError and is my assumption that apply will append the output df from process_item to final_response correct?
EDIT: Added sample data
Wrapping output from process_item in pd.Series -
#output from process_item
{'A': [456, 789], 'B': ['foo bar', 'dog bar'], 'C': [123.0, 160.0]}
#printing ouput in pd.Series
A [456, 789]
B [foo bar, dog bar]
C [123.0, 160.0]
#Adding a new 'item' column
A B C item
0 [456, 789] [foo bar, dog bar] [123.0, 160.0] bar
The below is from the first code snippet -
#output from process_item
{'A': [456, 789], 'B': ['foo bar', 'dog bar'], 'C': [123.0, 160.0]}
#output from process_item in pd.DataFrame
A B C
0 456 foo bar 123.0
1 789 dog bar 160.0
#Adding a new 'item' column
A B C item
0 456 foo bar 123.0 bar
1 789 dog bar 160.0 bar
I need the item added as per the second example.
EDIT(solved):
I was finally able to get this to work with some changes in the split_dataframe_rows function shared by @yugandhar.
1. Calculating max_split - it was calculating the length of the newly added 'item' column, which had 'bar' in it, so that was evaluating to 3, where as the other lists contained only two elements, added a type check.
2. split_rows[column_selector].pop(0)
was throwing an error for the 'item' column saying str object did not have a pop attribute. So, added a check to do this only if it was a list, otherwise just assign. Tested with your updated solution as well and works fine. Not sure why these issues did not come up on the colab, may be difference of python versions or something.
I tried explode, but it does not work for me either, I guess I am not using pandas 0.25. I will continue to look for better ways to do the split.