0

I have some python pandas dataframe like this one. It lists the position of labels in a printed table extracted with ocr. So each label position have a little offset.

     left  top       text
4     66   23  6/22/2021
6     66   82  6/23/2021
8     65  142  6/24/2021
10    65  202  6/25/2021
12    64  262  6/26/2021
16   345   25      14:00
18   354   85       7:30
20   344  145      13:00
22   344  206      11:00
24   343  265      10:00
26   343  325      15:00
30   859   23      20:30
32   860   84      14:00
34   858  144      20:23
36   859  204      18:00
38   858  264      13:00
40   858  324      18:15
44  1091   23       6:30
46  1091   84       6:30
48  1090  144       7:23
50  1090  204       7:00
52  1089  264       3:00
54  1089  324       3:15
56  1088  383       0:00
58  1087  443       0:00
60  1087  503       0:00
62  1047  563   33:38:00

I need to sort the data by the "left" column value, then set each group of values to a specific value.

In this case, the first five value [66,66,65,65,64] can be grouped together because they are in a narrow range (for instance [60...70]). The first five values will then be set to the min value of the range ([60...70], but it can be the range of the values [64...66]).

And so on, for each group of value, grouped by the fact that their values are in a narrow range.

The size of each group if random. In this case the last row have a "left" value of [1047]. It fits in a single value group.

The values are also random. I can't use this solution as far as I understand it : how to group by list ranges of value in python pandas

I will then do the same work for the second column "top".

What is the trick to do this ?

I know there is a mathematical way to do this. I can use it in some daw to "sharpen" a sound. But maybe there is a python pandas way to do this.

I'm not native english speaker. So I hope you understand me.

Thank you for your time

Edit:

What I want (but it can be the min value of each group, or here the first value of the group):

     left  top       text
4     66   23  6/22/2021
6     66   82  6/23/2021
8     66  142  6/24/2021
10    66  202  6/25/2021
12    66  262  6/26/2021
16   345   25      14:00
18   345   85       7:30
20   345  145      13:00
22   345  206      11:00
24   345  265      10:00
26   345  325      15:00
30   859   23      20:30
32   859   84      14:00
34   859  144      20:23
36   859  204      18:00
38   859  264      13:00
40   859  324      18:15
44  1091   23       6:30
46  1091   84       6:30
48  1091  144       7:23
50  1091  204       7:00
52  1091  264       3:00
54  1091  324       3:15
56  1091  383       0:00
58  1091  443       0:00
60  1091  503       0:00
62  1047  563   33:38:00
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
T H
  • 3
  • 2

1 Answers1

1

One way using pandas.Series.diff with cumsum trick to groupby.

Then use pandas.DataFrame.groupby.transform with "min":

df = df.sort_values("left")
ind1 = df["left"].diff().fillna(0).gt(10).cumsum()
df["left_min"] = df.groupby(ind1)["left"].transform("min")

df = df.sort_values("top")
ind2 = df["top"].diff().fillna(0).gt(10).cumsum()
df["top_min"] = df.groupby(ind2)["top"].transform("min")

print(df.sort_index())

Output:

    left  top       text  left_min  top_min
4     66   23  6/22/2021        64       23
6     66   82  6/23/2021        64       82
8     65  142  6/24/2021        64      142
10    65  202  6/25/2021        64      202
12    64  262  6/26/2021        64      262
16   345   25      14:00       343       23
18   354   85       7:30       343       82
20   344  145      13:00       343      142
22   344  206      11:00       343      202
24   343  265      10:00       343      262
26   343  325      15:00       343      324
30   859   23      20:30       858       23
32   860   84      14:00       858       82
34   858  144      20:23       858      142
36   859  204      18:00       858      202
38   858  264      13:00       858      262
40   858  324      18:15       858      324
44  1091   23       6:30      1087       23
46  1091   84       6:30      1087       82
48  1090  144       7:23      1087      142
50  1090  204       7:00      1087      202
52  1089  264       3:00      1087      262
54  1089  324       3:15      1087      324
56  1088  383       0:00      1087      383
58  1087  443       0:00      1087      443
60  1087  503       0:00      1087      503
62  1047  563   33:38:00      1047      563
Chris
  • 29,127
  • 3
  • 28
  • 51
  • Ok it seem to work well. But then I want for each group to set their value to the min one. – T H Dec 14 '21 at 00:54
  • @TH well the first group is of 64, 65, and 66, and the min one is 64, which is why `transform("min")`. But based on your edit, i suppose you want **max**? – Chris Dec 14 '21 at 00:56
  • Sorry. You are right. And you are a genius. Thank you a lot. – T H Dec 14 '21 at 01:01
  • @TH Great! Please fill free to mark the answer _accepted_ if it solved your problem. As a new comer, I suggest you take a look at [when someone answers](https://stackoverflow.com/help/someone-answers). – Chris Dec 14 '21 at 01:03