1

Where *related data is split across different dfs, (and within a list), ...is it possible to either:

a) avoid/pre-empt this when using array_split OR...
b) is there a way in which I can iterate over the dfs, looking for stragglers and put them together in one of the split/created dfs.

In an ideal solution : the number of dataframes/numberOfChunks must remain the same and some form of re-distribution would occur. OR, the splitting process would not split at rows where more than one partnerID existed in the original table - this would solve this problem even before it occurs but that's been beyond me thus far. Uneven distribution between the df chunks is then expected/acceptable but number of chunks should remain consistent.

Background/Explanation notes: The avg df size is tiny, never more than 2k. The array_split/numberOfChunks value changes every time in the range 1-15(and obv dictates list length). I've avoided using .sample() here, for the same reason that i must keep like partnerId's together. If there is a another way to achieve this, then any method is on the table. There are no restrictions in additional library use, helper columns etc. Any partnerID, single or paired can be in any of the df's, providing that if there are any matching partnerID's they're in the same df. A higher ratio of unique partnerIDs is the norm, where those with non-unique partnerIDs generally < 40% .

Example

(Example changed based on feedback (tevemadar Feb 25 at 8:05))

# Create some sample data
dfSampleUnOrdered = pd.DataFrame(np.random.randint(0,50,size=(10, 2)), columns=['A','partnerID'])
dfSampleUnOrdered = dfSampleUnOrdered.loc[dfSampleUnOrdered.index.repeat(3)].reset_index(drop=True)

# Add some example data for unique partnerIDs, so we have a few in the sample data 
newSinglesRowsList = [{'A':51, 'partnerID':100},{'A':52, 'partnerID':101},{'A':53, 'partnerID':103},
               {'A':0, 'partnerID':1},{'A':1, 'partnerID':2},{'A':2, 'partnerID':3},{'A':3, 'partnerID':9}]
for newRow in newSinglesRowsList: 
    # These are portable between any resultant df. Doesn't matter where they end up as all unique, no partners. 
    dfSampleUnOrdered = dfSampleUnOrdered.append(newRow, ignore_index=True)


numberOfChunks = 3 # Example, in reality this changes.     
dfSampleOrdered = dfSampleUnOrdered.sort_values(by=['partnerID']) 

# the resultant df's stored in a list (my live data is set up in this manner.)
dfSamplesList = np.array_split(dfSampleOrdered, numberOfChunks)
for dfChunk in dfSamplesList:
    print("{}\n".format(dfChunk))
    
    

# At this point. Any like-partnerID rows I'd like to be brought back together. But how... 

In the example, the sampling is random, but the partnerID's often end up split over the df's. How can for example, the 9's be brought together?

partnerID 9 would need to be moved into the 2nd df

Considered solutions:

  1. Given that the number of chunks/splits is known, find that number of row/partnerId indexes. Get the partnerIDs at each of those index points, check to see that on a count, only 3 exist(therefore no multiples/matches) and split. If > the numberOfChunks, find another set of split points within the original df.
  2. Find the ~len/shape[0] of intended chunked df's. Slice the incoming complete df at that point, check the last row... see if there are any remaining matching partnerID's and if yes bring them together. Start creating the next df, until all of them -1 are in place and whatever is left becomes the final one.
  3. Adding a helper col groupby cumcount/size on partnerID , and somehow using this to define allowable split points in the original df. i.e. if >1 don't split as there's more than one somewhere. Similar to this type of suggestion
John
  • 39
  • 5
  • That example is a bit engineered. If I make it blue,blue,fusia,fusia,red,red,white,white instead, it suddenly becomes impossible. It would be important for you to decide the ratio of such keep-together rows compared to the size of the entire dataset. If it's high, you may want to simply split along these boundaries and accept that the initial example would get split into 5 (blue-blue, fusia-fusia, orange, red-red, white). – tevemadar Feb 25 '22 at 08:05
  • Thx @tevemadar. I've added an alternative example. Can be now run to give varying scenarios, sometimes no changes needed, others the like partnerIDs will span more than one df. – John Mar 02 '22 at 15:00

1 Answers1

0

It's clunky, but in the between time since posting i've come up with this as a possible part solution. The handling of the last instance in the list isn't working. Perhaps someone can suggest some optimisation/improvements for this currently unpythonic/bruteforce solution.

dfSamplesListFixed  = []
indexList = list(range(len(dfSamplesList)))  
for i, dfChunk in enumerate(dfSamplesList):
        
    indexList.pop(0)
    
    if len(indexList)>1: # if there is still another df to check
        nextChunk = dfSamplesList[indexList[0]] # get it from the list of dfs
        nextChunkStrandedPartners = nextChunk[nextChunk['partnerID'] == dfChunk['partnerID'].iloc[-1]] # does the last row in current df have an
        
        
        dfSamplesList.pop(indexList[0])
        # have any partners in the next
        if nextChunkStrandedPartners.shape[0] > 0: # if we found some
            nextChunk = nextChunk[~nextChunk.partnerID.isin(nextChunkStrandedPartners.partnerID)]
            display(nextChunkStrandedPartners)
            print("Found some {}".format(nextChunkStrandedPartners.shape[0]))
            dfChunk = dfChunk.append(nextChunkStrandedPartners, ignore_index=True)
            
         
        
    dfSamplesListFixed.append(dfChunk)  # add them to a revised list
    # Needs work, as either catching x1 extra copy of final nextChunk, OR if the last two df's 
    # have shared partnerIDs, it doesn't keep the correct/adjusted one. 
    if len(dfSamplesListFixed) < len(dfSamplesList):
        dfSamplesListFixed.append(nextChunk)  #
    
    
for dfChunkFixed in dfSamplesListFixed:
    print("{}\n".format(dfChunkFixed))
    print(dfChunkFixed.shape[0])
John
  • 39
  • 5
  • Suprised thus far that no one has been able to suggest anything beyond the above. Would rather not accept the above solution as the accepted one as i'm struggling to believe that this is the best approach. – John Mar 05 '22 at 12:22