0

I'm new to Web Scraping. I would like to extract the values from the table into Pandas. The Table is an image, i.e one cannot right click and copy and paste the data by selecting it. How do i go about it? I'm unable to "identify" the table image and a simple ctrl+f of the console output seems to not return any of the values in the tableenter image description here.

from bs4 import BeautifulSoup as bs
import pandas as pd
pd.set_option('display.max_colwidth', 500)
import time
import requests
import pytesseract as pytes
import random
myURL =  "https://www.netztransparenz.de/EEG/Marktpraemie/EPEX-SPOT-Stundenkontrakte"
page = requests.get(myURL)

print (page)

soup = bs(page.content,"html.parser")

print(soup) <--- ctrl + F the result to locate values seen in the table
OldNick
  • 43
  • 6
  • 1
    It's quite tricky to convert images to frames. Is the data available anywhere else? – MDR Aug 24 '21 at 02:33
  • @MDR how yes but it has to paid for. Do you have any tips on locating the image within the HTML code? – OldNick Aug 24 '21 at 07:23

2 Answers2

1

I figured it out using pytesseract. Useful suggestions can be found here. Use pytesseract OCR to recognize text from an image

OldNick
  • 43
  • 6
  • I'd be interested to know if the processing was 100% accurate? It would be good if you can share the code and results. – MDR Aug 24 '21 at 13:21
1

I had a go myself. Pretty much my first go at using pytesseract so take this for what it's worth :o)...

Looking at the source of the page, the image of the table is available from the url:

https://www.netztransparenz.de/DesktopModules/LotesNetztransparenz/ImageCharts/EpexChartImageHandler.ashx?date=2021/08/23&type=1

The latter part of the url (not quite at the end) has the date: date=2021/08/23. If this is altered the data's time period changes and you get a new image. Hence the following code seems to work purely as a check of trying to access three images:

import urllib.request
from time import sleep

main_url = 'https://www.netztransparenz.de/DesktopModules/LotesNetztransparenz/ImageCharts/EpexChartImageHandler.ashx?date='

tail_url = '&type=1'

dates = ['2021/08/21', '2021/08/22', '2021/08/23']

for date in dates:
   r = urllib.request.urlopen(main_url+date+tail_url)
   print(r.getcode())
   sleep(1)
   
>> 200 #'200' is 'OK'
>> 200
>> 200

Anyway, for one url the following code gets close:

# https://pypi.org/project/opencv-python/
import cv2
import urllib.request  #I think (?) cv2 wants/prefers urllib instead of requests
import numpy as np

# sudo apt-get update
# sudo apt install tesseract-ocr
# sudo apt install libtesseract-dev
# ...or find for Mac/Windows.  Then...
# https://pypi.org/project/pytesseract/
import pytesseract
import pandas as pd
from io import StringIO

# first part of the url
main_url = 'https://www.netztransparenz.de/DesktopModules/LotesNetztransparenz/ImageCharts/EpexChartImageHandler.ashx?date='

# end part of the url
tail_url = '&type=1'

# date part - the code could be expanded to loop different dates
# and get different frames and concat them together.  For demo let's just get just one
date = '2021/08/23'

#construction a valid url.  If looping and the 'date' above was a list of dates this method would make more sense
url = main_url + date + tail_url

# this probably wants a try except around it and proceed only on an OK/200 response
url_response = urllib.request.urlopen(url)

# download the image
img = cv2.imdecode(np.array(bytearray(url_response.read()), dtype=np.uint8), -1)

# get the image into text
text = pytesseract.image_to_string(img)

# create a rough frame and clean up
df = pd.read_csv(StringIO(text), sep=';', header=0)
df = df[:-2]
# df['a'].str.rsplit(n=7, expand=True)
df['Stunden'] = df['a'].str.extract(r'((?:\d{2}:?\d{2}) - (?:\d{2}:?\d{2}))\s')
df = df.set_index(df['Stunden'])
df['a'] = df['a'].str.replace(r'((?:\d{2}:?\d{2}) - (?:\d{2}:?\d{2}))\s', '', regex=True)
df = df['a'].str.rsplit(n=7, expand=True)
df = df.reset_index()

# create a dict for column renaming
datesplit = [int(x) for x in date.split('/')]
weekdates = {}
count = 0
for i in range(7, -1, -1):
    d = datesplit[2] - i
    weekdates[count] = (str(datesplit[0]) + '-' + str(datesplit[1]) + '-' + str(d))
    count += 1

# rename columns
df = df.rename(columns=weekdates)

# output frame
display(df)

enter image description here

It doesn't look too bad. However, if you look closely it's not perfect...

enter image description here

If the frame has only a few glitches in the image processing you could correct them manually with lines like:

df.loc[12, '2021-8-20'] = '9,911'
df.loc[17, 'Stunden'] = '17:00 - 18:00'

However, if there are more issues (and there probably are) it would be better replacing the dots and commas completely and/or trying a broader clean up strategy. Or looking to improve the processing of the image (that would be to post another question if you get stuck and an answer from someone who knows more about the modules pytesseract and/or cv2).

Btw: You can tell the problem is in the image processing because a check of the text variable shows what was originally extracted before putting it into a dataframe...

enter image description here

MDR
  • 2,610
  • 1
  • 8
  • 18