0

I like to find and extract all closed text based on same line and distance between text < 10 (x2 - x < 10) from pandas dataframe. x,y,x2,y2 are coordinates of bounding box which contains text. Texts can be different each time (string, float, int,...).

In my example, I want extract 'Amount VAT' idx 70 and 71: there are on same line, and distance from 'VAT'[x] - 'Amount'[x2] < 10

    line    text    x       y       x2      y2
29  11      Amount  2184    1140    2311    1166
51  14      Amount  1532    1450    1660    1476
66  15      Amount  1893    1500    2021    1527
70  16      Amount  1893    1551    2022    1578
71  16      VAT     2031    1550    2121    1578

Final result must be:

    line    text    x       y       x2      y2
70  16      Amount  1893    1551    2022    1578
71  16      VAT     2031    1550    2121    1578

and extraction should work for 2 or more text on same line and (x2 - x < 10). Other result with 3 values:

    line    text    x       y       x2      y2
5   16      Total   1755    1551    1884    1578
8   16      Amount  1893    1551    2022    1578
20  16      VAT     2031    1550    2121    1578

I find a way to find same lines:

same_line = find_labels['line'].map(find_labels['line'].value_counts() > 1)

and I try to find near values x2 - x < 10, but I don't how to do this. I try to make loop or use .cov() but not working...

Some can help me ?

Thanks for your help

Manu64
  • 35
  • 2
  • 7
  • Possible duplicate of [How to iterate over rows in a DataFrame in Pandas?](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas) – lopezdp Dec 29 '18 at 04:20
  • Maybe you might want to reformat the question to include an approach you have already tried or are currently debugging. Check out the help docs on how to ask the best question possible to try to get as much help as you can without expecting others to do the work for you. – lopezdp Dec 29 '18 at 04:21
  • are the lines Amount and VAT always offset by 1? You're probably better off trying to line them up on one row. – MrE Dec 29 '18 at 06:07
  • is it correct that `line` is the same for the Amount and VAT? in which case I would split the table into Amount on one side, and VAT on the other, then merge the 2 DataFrames on `line` and you'll have everything in one row – MrE Dec 29 '18 at 06:08
  • yes line is the same. Goal is to find a full label of vat for example and not only part of it. Total Amout VAT is the label I'm searching to find (based on dictionary). Sometime is Amout VAT, sometime Total Amout VAT and I want to find label with maximum number of words in my dictionary. – Manu64 Jan 09 '19 at 22:20
  • Any help to finish ? Thanks ! – Manu64 Jan 10 '19 at 15:13

1 Answers1

0

Assuming VAT and Amount are both indexed by the same line value, I would do this:

# set the index in line
df.set_index('line', inplace=True)

#split up the table into the 2 parts to work on
amount_df = df[df['text'] == 'Amount']
vat_df = df[df['text'] == 'VAT']

# join the 2 tables to get everything on one row
df2 = amount_df.join(vat_df, how='outer', on='line', rsuffix='amount', lsuffix='vat')

# do the math
condition = df2['xvat'] - df2['x2amount'] < 10
df2 = df2[condition]

df2['text'] = 'Total'
df2['x'] = df2['xvat'] - (df2['xamount'] - df2['xvat'])
df2['y'] = df2['yvat'] - (df2['yamount'] - df2['yvat'])
df2['x2'] = df2['x2vat'] - (df2['x2amount'] - df2['x2vat'])
df2['y2'] = df2['y2vat'] - (df2['y2amount'] - df2['y2vat'])
df.append(df2[['text','x','y','x2','y2']])

I get

enter image description here

not quite exactly what you asked, but you get the idea. Not sure what the right math is that gives you the results you show

MrE
  • 19,584
  • 12
  • 87
  • 105