I had a go myself. Pretty much my first go at using pytesseract
so take this for what it's worth :o)...
Looking at the source of the page, the image of the table is available from the url:
https://www.netztransparenz.de/DesktopModules/LotesNetztransparenz/ImageCharts/EpexChartImageHandler.ashx?date=2021/08/23&type=1
The latter part of the url (not quite at the end) has the date: date=2021/08/23
. If this is altered the data's time period changes and you get a new image. Hence the following code seems to work purely as a check of trying to access three images:
import urllib.request
from time import sleep
main_url = 'https://www.netztransparenz.de/DesktopModules/LotesNetztransparenz/ImageCharts/EpexChartImageHandler.ashx?date='
tail_url = '&type=1'
dates = ['2021/08/21', '2021/08/22', '2021/08/23']
for date in dates:
r = urllib.request.urlopen(main_url+date+tail_url)
print(r.getcode())
sleep(1)
>> 200 #'200' is 'OK'
>> 200
>> 200
Anyway, for one url the following code gets close:
# https://pypi.org/project/opencv-python/
import cv2
import urllib.request #I think (?) cv2 wants/prefers urllib instead of requests
import numpy as np
# sudo apt-get update
# sudo apt install tesseract-ocr
# sudo apt install libtesseract-dev
# ...or find for Mac/Windows. Then...
# https://pypi.org/project/pytesseract/
import pytesseract
import pandas as pd
from io import StringIO
# first part of the url
main_url = 'https://www.netztransparenz.de/DesktopModules/LotesNetztransparenz/ImageCharts/EpexChartImageHandler.ashx?date='
# end part of the url
tail_url = '&type=1'
# date part - the code could be expanded to loop different dates
# and get different frames and concat them together. For demo let's just get just one
date = '2021/08/23'
#construction a valid url. If looping and the 'date' above was a list of dates this method would make more sense
url = main_url + date + tail_url
# this probably wants a try except around it and proceed only on an OK/200 response
url_response = urllib.request.urlopen(url)
# download the image
img = cv2.imdecode(np.array(bytearray(url_response.read()), dtype=np.uint8), -1)
# get the image into text
text = pytesseract.image_to_string(img)
# create a rough frame and clean up
df = pd.read_csv(StringIO(text), sep=';', header=0)
df = df[:-2]
# df['a'].str.rsplit(n=7, expand=True)
df['Stunden'] = df['a'].str.extract(r'((?:\d{2}:?\d{2}) - (?:\d{2}:?\d{2}))\s')
df = df.set_index(df['Stunden'])
df['a'] = df['a'].str.replace(r'((?:\d{2}:?\d{2}) - (?:\d{2}:?\d{2}))\s', '', regex=True)
df = df['a'].str.rsplit(n=7, expand=True)
df = df.reset_index()
# create a dict for column renaming
datesplit = [int(x) for x in date.split('/')]
weekdates = {}
count = 0
for i in range(7, -1, -1):
d = datesplit[2] - i
weekdates[count] = (str(datesplit[0]) + '-' + str(datesplit[1]) + '-' + str(d))
count += 1
# rename columns
df = df.rename(columns=weekdates)
# output frame
display(df)

It doesn't look too bad. However, if you look closely it's not perfect...

If the frame has only a few glitches in the image processing you could correct them manually with lines like:
df.loc[12, '2021-8-20'] = '9,911'
df.loc[17, 'Stunden'] = '17:00 - 18:00'
However, if there are more issues (and there probably are) it would be better replacing the dots and commas completely and/or trying a broader clean up strategy. Or looking to improve the processing of the image (that would be to post another question if you get stuck and an answer from someone who knows more about the modules pytesseract
and/or cv2
).
Btw: You can tell the problem is in the image processing because a check of the text
variable shows what was originally extracted before putting it into a dataframe...
