I like to find and extract all closed text based on same line and distance between text < 10 (x2 - x < 10) from pandas dataframe. x,y,x2,y2 are coordinates of bounding box which contains text. Texts can be different each time (string, float, int,...).
In my example, I want extract 'Amount VAT' idx 70 and 71: there are on same line, and distance from 'VAT'[x] - 'Amount'[x2] < 10
line text x y x2 y2
29 11 Amount 2184 1140 2311 1166
51 14 Amount 1532 1450 1660 1476
66 15 Amount 1893 1500 2021 1527
70 16 Amount 1893 1551 2022 1578
71 16 VAT 2031 1550 2121 1578
Final result must be:
line text x y x2 y2
70 16 Amount 1893 1551 2022 1578
71 16 VAT 2031 1550 2121 1578
and extraction should work for 2 or more text on same line and (x2 - x < 10). Other result with 3 values:
line text x y x2 y2
5 16 Total 1755 1551 1884 1578
8 16 Amount 1893 1551 2022 1578
20 16 VAT 2031 1550 2121 1578
I find a way to find same lines:
same_line = find_labels['line'].map(find_labels['line'].value_counts() > 1)
and I try to find near values x2 - x < 10, but I don't how to do this. I try to make loop or use .cov() but not working...
Some can help me ?
Thanks for your help