I have a data frame with multiple columns (I get it from pytesseract.image_to_data(img_pl,lang="eng", output_type='data.frame', config='--psm 11')
[used psm 11 or 12, same result] and taking only the important columns from it), lets look on the following columns:
# This is the data I get from the above command,
# I added it like that so you will be able to copy and test it
data = {'left': [154, 154, 200, 154, 201, 199],
'top': [0, 3, 3, 7, 8, 12],
'width': [576, 168, 162, 168, 155, 157],
'height': [89, 10, 10, 10, 10, 10],
'text': ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']}
output_test_min_agg = pd.DataFrame(data)
# Output:
+----+---+-----+------+-------+
|left|top|width|height| text|
+----+---+-----+------+-------+
| 154| 0| 576| 89| text1|
| 154| 3| 168| 10| text2|
| 200| 3| 162| 10| text3|
| 154| 7| 168| 10| text4|
| 201| 8| 155| 10| text5|
| 199| 12| 157| 10| text6|
+----+---+-----+------+-------+
Notice that some of the coordinates are off by few pixels (from what I saw its maximum 3-5 pixels off) that is why the width can also be taken to account (for example the left of "abc" and "abcdef" will be different but with the width we can see that it reaches to the same size
Excepted result will be as below:
+-----+-------+-------+
|index| col 01| col 02|
+-----+-------+-------+
| 0| text1| |
| 1| text2| text3|
| 2| text4| text5|
| 3| | text6|
+-----+-------+-------+
The best result I get is from this:
output_test_min_agg=output_test_min.sort_values('top', ascending=True)
output_test_min_agg = output_test_min_agg.groupby(['top', 'left'], sort=False)['text'].sum().unstack('left')
output_test_min_agg.reindex(sorted(output_test_min_agg.columns), axis=1).dropna(how='all')
But it's still not good because if the top
or left
have even 1 pixel difference it will create a whole new column and row for them
How can I accomplish such a task?