1

I simply want to put the data of this webpage(http://smartpro-data.rwth-aachen.de):

web page

into a data frame which should look exactly the same.

Any suggestions how to do it?

And is there a possibility to just take a certain amount of rows? Because its a webpage which refreshes itself every 5 seconds or so and I want to put the data into a streaming dashboard. Therefore it would be helpful to read all the rows for the first time and then update the dashboard every couple of seconds with just reading the first couple of rows.

Thanks for your help in advance Kerwyn

Kerwyn
  • 21
  • 1
  • Foremost, as an example you could copy the content into a `.txt` file and then extract it row by row splitting them by column `re` library. Start from [here](https://stackoverflow.com/questions/3277503/how-do-i-read-a-file-line-by-line-into-a-list) and [here](https://stackoverflow.com/questions/4998629/python-split-string-with-multiple-delimiters) – E.Z Oct 14 '17 at 17:34
  • Hey, thanks for your answer :) The "copy the content into a ".txt"-file" is the thing I am struggling at – Kerwyn Oct 14 '17 at 17:45
  • It is pretty simple. The library you are looking for is `urlopen`. [This](https://stackoverflow.com/questions/33566843/how-to-extract-text-from-html-page/33566923#comment54911928_33566843) will suffice. – E.Z Oct 14 '17 at 17:49
  • Thanks. The only problem is I am not able to see the value of Bpm, I do only get this per line: Bpm:
    Sun Jan 10 11:29:47 2016 : Fix: G3 Coord: 5046.0588 606.1884 #Sat:9
    – Kerwyn Oct 14 '17 at 17:58
  • Okay, hold on for a sec I will post a solution (hopefully). – E.Z Oct 14 '17 at 18:00
  • Thank you. And it only reads till April 26 11:22:39 instead of reading every row till today... – Kerwyn Oct 14 '17 at 18:18

1 Answers1

0

Read HTML Data Directly From Website

First you load the page and then you dump it in the list:

import requests
import re

url = 'http://smartpro-data.rwth-aachen.de/'
html = requests.get(url)
text = html.text.splitlines()  # reads text and splits by newspaces

Okay, now it needs some tinkering. If you look at text[0] you will see that it includes two strings that are separated by <br>. We want to omit every present HTML tag out of every string. We do so by defining a function:

def cleanhtml(raw_html):
   cleanr = re.compile('<.*?>')
   cleantext = re.sub(cleanr, '', raw_html)
   return cleantext

data = []
for line in text:
   data.append(cleanhtlm(line))
del html, text  # releases RAM

That would give you in the end something like this:

['Sat Oct 14 20:20:37 2017 : Fix: G3 Coord: 50.9355 6.9443 #Sat:8 Bpm: 156'
'Sat Oct 14 20:20:23 2017 : Fix: G3 Coord: 50.9353 6.9443 #Sat:7 Bpm: 164',
...]

If you wish you may place data into a pandas.DataFrame or numpy.array later. Copying the contents in .txt first is rather obsolete then...

By the way, the data itself posted on the webiste is skewed: enter image description here

E.Z
  • 1,958
  • 1
  • 18
  • 27