I've used python and PYtesseract to run OCR on an image. Here's my code:
test = pytesseract.image_to_string(img)
and then I converted that to a data frame:
data = io.StringIO(result)
df = pd.read_csv(data, index_col=False, sep=",")
however this stores all the data from the image into a single column. formatted like this:
TimeLine (column header)<break>
schedule<break>
log_in<break>
log_out
Advisor (should be the second column header)
James
Mathew
Kent
I want to split the column horizontally into separate data frames. So that it will be formatted like this:
Timeline(header) Advisor(header)
Schedule James
Log_in Mathew
Log_out Kent
The issue is that the values are not all the same, so I can't use a group by function. I also can't use the df.iloc[0:3]
option either because the values will not consistently be on the same rows every time I do this. I've tried using new_df = df.loc[:'Advisor']
to try and define a new data frame, but all that does is return the entire data frame without an error.
Is there a way to tell it to split horizontally into a separate data Frame based on a unique cell value? So like, split df where column value = 'Advisor'.
The split function is easy to use if I want to split things vertically. But I can't see an easy way to split a column horizontally based on a unique value within that column.
I'm super frustrated because this has to be something that happens all the time but I've been looking around for hours and can't find any solution.