Create a dataframe - order based on text coordinates

Question

I have a data frame with multiple columns (I get it from pytesseract.image_to_data(img_pl,lang="eng", output_type='data.frame', config='--psm 11') [used psm 11 or 12, same result] and taking only the important columns from it), lets look on the following columns:

# This is the data I get from the above command,
# I added it like that so you will be able to copy and test it
data = {'left': [154, 154, 200, 154, 201, 199],
        'top': [0, 3, 3, 7, 8, 12],
        'width': [576, 168, 162, 168, 155, 157],
        'height': [89, 10, 10, 10, 10, 10],
        'text': ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']}
output_test_min_agg = pd.DataFrame(data)
# Output:
+----+---+-----+------+-------+
|left|top|width|height|   text|
+----+---+-----+------+-------+
| 154|  0|  576|    89|  text1|
| 154|  3|  168|    10|  text2|
| 200|  3|  162|    10|  text3|
| 154|  7|  168|    10|  text4|
| 201|  8|  155|    10|  text5|
| 199| 12|  157|    10|  text6|
+----+---+-----+------+-------+

Notice that some of the coordinates are off by few pixels (from what I saw its maximum 3-5 pixels off) that is why the width can also be taken to account (for example the left of "abc" and "abcdef" will be different but with the width we can see that it reaches to the same size

Excepted result will be as below:

+-----+-------+-------+
|index| col 01| col 02|
+-----+-------+-------+
|    0|  text1|       |
|    1|  text2|  text3|
|    2|  text4|  text5|
|    3|       |  text6|
+-----+-------+-------+

The best result I get is from this:

output_test_min_agg=output_test_min.sort_values('top', ascending=True)
output_test_min_agg = output_test_min_agg.groupby(['top', 'left'], sort=False)['text'].sum().unstack('left')
output_test_min_agg.reindex(sorted(output_test_min_agg.columns), axis=1).dropna(how='all')

But it's still not good because if the top or left have even 1 pixel difference it will create a whole new column and row for them

How can I accomplish such a task?

are you going to explain how you got from the first table to the second or should we try to guess it? — Dr. Prof. Patrick, Jan 26 '23 at 12:29
That is my desired result, Im trying to get to it but 0 luck so far, I want to find a way to create the second table from the first, I've been on this for hours but I still have no luck in figuring it out — Dolev Mitz, Jan 26 '23 at 12:30
you should try to help us help you, i dont see any connection whatsoever between the two tables. why is the second one is the expected output — Dr. Prof. Patrick, Jan 26 '23 at 12:32
I added what I tried so far, and it is the expected output for me I need the table like that, and Im trying to figure out how to get to this result based on the data I have — Dolev Mitz, Jan 26 '23 at 12:34
While you trail code is beneficial, you should also explain the logic of the translation in words so we understand. — itprorh66, Jan 26 '23 at 14:27
@itprorh66 the first table is what I get from pytessrect like I wrote and according to the top-left position and width I can find out which cell is located where on the table, and I want to populate another dataframe (the second one) according to that so the cells and text will be in the same order as in the table in the image — Dolev Mitz, Jan 26 '23 at 16:04
Are you trying to achieve or use [this](https://i.stack.imgur.com/Qg5oV.png), which I found this [answer](https://stackoverflow.com/a/64532821/10452700). I read this [workaround](https://stackoverflow.com/a/54059166/10452700) to understand _top_ & _left_ parameters. it seems you are dealing with _Tesseract OCR: Text localization and detection_ but still, I don't get How you reflect the localization of detected text using table 1 to the expected table as output? — Mario, Jan 26 '23 at 18:11
@Mario if you'll look I wrote that my table is a result of a method of pytesseract which is the python library of tesseract OCR, and I know the meaning of each and every one of them (left top etc), what I want to do is convert between the two tables, I'm trying to convert the first table to the second (i.e building the new tables that holds only the text in the same order that the image holds the texts based on top left and width (I think this 2 are enough) — Dolev Mitz, Jan 26 '23 at 22:32
Does the index names or the column names important? or has anything to do with the actually `DataFrame` values? — Lidor Eliyahu Shelef, Jan 29 '23 at 08:38
Hi, no they are not important right now, generally the first row of the dataframe will be the header row — Dolev Mitz, Jan 29 '23 at 13:59
I changed the column names and added a little more details on the method I used (`image_to_data`) — Dolev Mitz, Jan 31 '23 at 10:50
I finally understand what's going on after looking up `pytesseract`. An image was scanned for text. We're provided each text segment, the position in pixels relative to the top and left of the image, as well as the width and height of the text in pixels. It appears the text is in the shape of a the "expected result" table. Text 1,2, and 4 are aligned to the left. Text 3, 5, and 6 are aligned to the right. Text 1 is on its own row, 2 and 3 are on the next row, 4 and 5 on the following row, and text 6 on the last row. Pixels are off by about 1 for rows, up to 5 for columns. — amance, Jan 31 '23 at 21:55

score 5 · Accepted Answer · answered Feb 01 '23 at 09:10

I accomplished it by doing the following:

I made 3 functions for each purpose

1) Using your dummy data:

import pandas as pd
import numpy as np
# Create a dictionary of data for the DataFrame
data = {'left': [154, 154, 200, 154, 201, 199],
        'top': [0, 3, 3, 7, 8, 12],
        'width': [576, 168, 162, 168, 155, 157],
        'height': [89, 10, 10, 10, 10, 10],
        'text': ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']}
# Create the DataFrame
df = pd.DataFrame(data)

2) Creating a function, using the code you supply + adding to it handling of the `NaN` values

def optimizeDf(df: pd.DataFrame) -> pd.DataFrame:
    df['left+width'] = df['left'] + df['width']
    df = df.sort_values(by=['top'], ascending=True)
    df = df.groupby(['top', 'left+width'], sort=False)['text'].sum().unstack('left+width')
    df = df.reindex(sorted(df.columns), axis=1).dropna(how='all').dropna(axis='columns', how='all')
    df = df.fillna('')
    return df
df = optimize_df(df)

3) Creating a function to merge the columns based on the name threshold similarity:

def mergeDfColumns(old_df: pd.DataFrame, threshold: int = 10) -> pd.DataFrame:
    new_columns = {}
    old_columns = old_df.columns
    i = 0
    while i < len(old_columns) - 1:
        if any(old_columns[i+1] == old_columns[i] + x for x in range(1, threshold)):
            new_col = old_df[old_columns[i]] + old_df[old_columns[i+1]]
            new_columns[old_columns[i+1]] = new_col
            i += 1
        else:
            new_columns[old_columns[i]] = old_df[old_columns[i]]
        i += 1
    new_columns[old_columns[i]] = old_df[old_columns[i]]
    return pd.DataFrame.from_dict(new_columns).replace('', np.nan).dropna(axis='columns', how='all').fillna('')
df = mergeDfColumns(df)

4) Creating a function to merge the rows based on the name threshold similarity:

def mergeDfRows(old_df: pd.DataFrame, threshold: int = 2) -> pd.DataFrame:
    new_df = old_df.iloc[:1]
    for i in range(1, len(old_df)):
        if abs(old_df.index[i] - old_df.index[i - 1]) < threshold:
            new_df.iloc[-1] = new_df.iloc[-1] + old_df.iloc[i]
        else:
            new_df = new_df.append(old_df.iloc[i])
    return new_df.reset_index(drop=True)
df = mergeDfRows(df)

The end result will be as follows:

+-+-----+-----+-----+
| |  322|  362|  730|
+-+-----+-----+-----+
|0|     |     |text1|
|1|text2|text3|     |
|2|text4|text5|     |
|3|     |text6|     |
+-+-----+-----+-----+

That is the best result I got from your dummy data, but please notice how text1 gets it's own row and column, it's because of the data if you'll look you'll see it's width and height are huge compare to the others, what I think is that your table in the image have some sort of a title that is really close to it and pytesseract recognized it as part of the table, my suggestions to you is to try some other config options or use some deep learning in order to classify your table better.

Laurent B. · Answer 2 · 2023-02-04T12:42:46.127

0

import pandas as pd

data = {'left': [154, 154, 200, 154, 201, 199],
        'top': [0, 3, 3, 7, 8, 12],
        'width': [576, 168, 162, 168, 155, 157],
        'height': [89, 10, 10, 10, 10, 10],
        'text': ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']}

df = pd.DataFrame(data)

df1 = pd.DataFrame({'col 01': df['text'][[0,1,3]]}).reset_index(drop=True)
df2 = pd.DataFrame({'col 02': df['text'][[2,4,5]]}).reset_index(drop=True)
df2.index+=1

print(df1.join(df2, how='outer').fillna(''))

#   col 01 col 02
# 0  text1       
# 1  text2  text3
# 2  text4  text5
# 3         text6

edited Feb 04 '23 at 12:42

answered Feb 02 '23 at 20:33

Laurent B.

1,653
1
7
16

Hey, thank you but your answer is for this very specific example, I wanted something more flexible – Dolev Mitz Feb 06 '23 at 04:41