0

I have my dataframe like this. You can see that the column called filename indicates that this row is in a file and another row is in a different file or not. I have also created another column to count the total number of rows are in a file. enter image description here

I have extracted the ymin and ymax by concatenating all of them to a list and the result is a 2D list:

y = [[4, 43], [9, 47], [76, 122], [30, 74], [10, 47], [81, 125], [84, 124], [47, 90], [1, 38],[2, 40], [2, 44], [4, 48], [5, 48], [6, 44], [8, 45], [75, 116], [73, 123], [28, 73], [39, 84], [84, 121], [2, 39],...]

Thus, this only puts all the coordinates into a list without knowing which belongs to the first file and which belongs to the second

My approach is making a 3D list like this:

y = [[[4, 43], [9, 47], [76, 122], [30, 74], [10, 47], [81, 125], [84, 124], [47, 90], [1, 38]],[[2, 40], [2, 44], [4, 48], [5, 48], [6, 44], [8, 45], [75, 116], [73, 123], [28, 73], [39, 84], [84, 121], [2, 39]],...]

You can see from [4,43] to [1,38] that they are in the same file. You can also see from [2,40] to [2,39] that they are also in the same file.

Here is my current attempt

def get_y_coordinate(count):
    """
    Create a 3D list of y_coordinates that can distinguish which list is in a file which list belongs to another file
    :param: count - the list taken from the column "count" from the dataframe
    """
    c = 2 # Number of chunks to make
    fi_y= lambda y, c: [y[i:i+c] for i in range(0, len(y), c)] # Making y into chunks of 2 ex: from [4,43,9,47] to [4,43],[9,74]
    y = fi_y(y,c) # Now y is [[4, 43], [9, 47],...]]

    # This is my current approach, I create a new list called bigy. 

    bigy = []   
    current = 0 
    for i in count:
        if current != i:
            bigy.append([y[j] for j in range(current, i+current)])
        current = i
        
    return bigy
>> bigy = [[[4, 43], [9, 47], [76, 122], [30, 74], [10, 47], [81, 125], [84, 124], [47, 90], [1, 38]],[[2, 40], [2, 44], [4, 48], [5, 48], [6, 44], [8, 45], [75, 116], [73, 123], [28, 73], [39, 84], [84, 121], [2, 39]],...]

I achieved the result for a first few hundreds of files. However, up until around file 700, it doesn't work anymore. I need another approach for this problem if anyone can be patient enough to read to here and help me out. Thank you very much!

1 Answers1

0

I think my first inclination would be to just iterate over the dataframe and collect results in a defaultdict. Maybe something like:

import collections
import pandas

mock_data = pandas.DataFrame([
    {"Name": "product_name", "ymin": 4, "ymax": 43},
    {"Name": "product_name", "ymin": 9, "ymax": 47},
    {"Name": "product_total_money", "ymin": 76, "ymax": 122},
    {"Name": "vat", "ymin": 30, "ymax": 74},
    {"Name": "product_name", "ymin": 10, "ymax": 47}
])

y_results = collections.defaultdict(list)
for _, row in mock_data.iterrows():
    y_results[row["Name"]].append((row["ymin"], row["ymax"]))

print(y_results)

Alternately, you might also try:

mock_data.groupby('Name').agg(lambda x: list(x))
JonSG
  • 10,542
  • 2
  • 25
  • 36