0

I've used python and PYtesseract to run OCR on an image. Here's my code:

test = pytesseract.image_to_string(img)

and then I converted that to a data frame:

data = io.StringIO(result)
df = pd.read_csv(data, index_col=False, sep=",")

however this stores all the data from the image into a single column. formatted like this:

TimeLine (column header)<break>
schedule<break>
log_in<break>
log_out
Advisor (should be the second column header)
James
Mathew
Kent

I want to split the column horizontally into separate data frames. So that it will be formatted like this:

Timeline(header)   Advisor(header)

Schedule         James

Log_in           Mathew

Log_out          Kent

The issue is that the values are not all the same, so I can't use a group by function. I also can't use the df.iloc[0:3] option either because the values will not consistently be on the same rows every time I do this. I've tried using new_df = df.loc[:'Advisor'] to try and define a new data frame, but all that does is return the entire data frame without an error.

Is there a way to tell it to split horizontally into a separate data Frame based on a unique cell value? So like, split df where column value = 'Advisor'.

The split function is easy to use if I want to split things vertically. But I can't see an easy way to split a column horizontally based on a unique value within that column.

I'm super frustrated because this has to be something that happens all the time but I've been looking around for hours and can't find any solution.

piterbarg
  • 8,089
  • 2
  • 6
  • 22
BrOtis
  • 1
  • 2
  • Please post the dataframe, or a representative example, you want transformed, as a construction code: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – piterbarg Feb 04 '22 at 07:55
  • current_df = ({'Timeline': ['Schedule', 'Log_in', 'Log_out', 'Advisor', 'James', 'Mathew', 'Kent']}) desired_df = ({'Timeline': ['Schedule', 'Log_in', 'Log_out'], 'Advisor': ['James', 'Mathew', 'Kent']}) – BrOtis Feb 04 '22 at 18:02
  • You asked me to present the DF in the form that was outlined in that link you sent me. So that's what I did. Im not sure why its so confusion. I have a full data frame that is in 1 column. I need to split the column into multiple data frames. I have no idea how else to describe it??? the code above is what the data frames would look like if I were to have created them by hand. its the exact same data as what I put in the post. – BrOtis Feb 04 '22 at 18:37
  • ah sorry when I looked at it it had nothing to do with your question. must have been a glitch or an intermediate state. makes more sense now – piterbarg Feb 04 '22 at 18:50

1 Answers1

0

This needs a few steps

We start with the df:

current_df = pd.DataFrame({'Timeline': ['Schedule', 'Log_in', 'Log_out', 'Advisor', 'James', 'Mathew', 'Kent']}) 

For the current_df, we need to set the row and column indices by breaking the column of curent_df at the Advisor value. Slightly complicated by the column labels vs values, but we can do it like so:

df = current_df.T.reset_index().T
df['colnum'] = (df[0]=='Advisor').cumsum()
df['rownum'] = df.groupby('colnum', group_keys = False).cumcount()

at this point df looks like this. Note 'colnum' and 'rownum' that will be used later

       0           colnum    rownum
-----  --------  --------  --------
index  Timeline         0         0
0      Schedule         0         1
1      Log_in           0         2
2      Log_out          0         3
3      Advisor          1         0
4      James            1         1
5      Mathew           1         2
6      Kent             1         3

Once these values are set it becomes a simple matter of unstack with a few cleanup operations

df = (df.set_index(['rownum', 'colnum'])
        .unstack()
        .droplevel(level=0,axis=1)
        .T.set_index(0).T
)

output:

  rownum  Timeline    Advisor
--------  ----------  ---------
       1  Schedule    James
       2  Log_in      Mathew
       3  Log_out     Kent
piterbarg
  • 8,089
  • 2
  • 6
  • 22
  • Wow! that is exactly the solution I was hoping for. Man, I've been asking around and looking every where for days. I was on the verge of giving up. Thank you! – BrOtis Feb 06 '22 at 20:46
  • Question for you, if you were going to repeat this same sort of thing but for more column headers that are in that DF, how would you do that? So, lets say instead of it just being Timeline and Advisor, that after the advisor data there is "State" and then state data, and after state there is, "Start time" and start time data. how would you separate those out as well? I thought I could figure out a way to take what you did and then just repeat that, but it doesn't seem to work. Still trying though. thank you for your help! – BrOtis Feb 06 '22 at 21:07