0

I'm trying to store incoming real times (prices) in a table and then export it for further analysis but i failed. If someone could help me that would be great.

By now i've manage to write a code that scrapes the price on the web site but i don't know how to store this incoming datas(prices) in a table so i could export it for further analysis. I was thinking about using pandas but then i saw a topic on stackoverflow where they say that hdf5 would be a better way to do it but i failed to implement it.

here: How to handle incoming real time data with python pandas

import bs4
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

def real_price():
    r = requests.get('https://fr.finance.yahoo.com/quote/fb?ltr=1')

    soup = BeautifulSoup(r.text,'xml')

    price = soup.find_all('div', {
        'class' : 'My(6px) Pos(r) smartphone_Mt(6px)'
    })[0].find('span').text

    return price

starttime = time.time()

while True:
    print (real_price())
    time.sleep(10.0 - ((time.time() - starttime) % 10.0))

this code works fine. It returns the price every 10 seconds.

  • hdf5 is a data storage format and pandas is a data manipulation tool. It's not clear to me what you're trying to do – roganjosh Jun 14 '19 at 12:00
  • If you can get it to work nicely in pandas, there is a [.to_hdf](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_hdf.html) feature to keep it backed up. You can also load it back into pandas from hdf5. – cardamom Jun 14 '19 at 12:01
  • i'm sorry i'm new to coding i'll try to explain my issue as clearly as i can – jr ewing Jun 14 '19 at 12:31
  • i get a price every 10 second – jr ewing Jun 14 '19 at 12:31
  • i'm trying to store that price in a table and i intend to export this table to run analysis on the prices.I don t know how to write it correctly. i don't know how to capture the incoming price into a panda dataframe – jr ewing Jun 14 '19 at 12:32
  • @jrewing what do you mean by a table? A dictionary? An array? In that case, simply put `data = []` at the top of your code (below the imports, though) and do `data.append(real_price)`. To get the prices back out, do `print(str(data))`. – Geza Kerecsenyi Jun 14 '19 at 12:53
  • Thank you i will try it. – jr ewing Jun 14 '19 at 13:05
  • i get this result:[] [, ] [, , ] – jr ewing Jun 14 '19 at 13:18
  • how could i just get the prices? – jr ewing Jun 14 '19 at 13:18
  • data.append(real_price()) should give you the prices. Without parenthesis you store a reference your function, not the output of that function... – chuni0r Jun 14 '19 at 14:30
  • thank you so i get a list that is growing every 10 second but how could i get this list growing and in the same time periodically export its content in a csv format? – jr ewing Jun 14 '19 at 14:53
  • @chuni0r thanks, that was a mistake (I was in a rush while writing that comment). @jrewing use the built-in `csv` module - see https://stackoverflow.com/questions/21465447/writing-array-to-csv-python-one-column?rq=1 – Geza Kerecsenyi Jun 14 '19 at 15:08

1 Answers1

0

I suppose you want to store prices of a financial instrument with a timestamp, so you then can sort the time-serie and work with it. I tried your code (it's one year old, I know!) but it does not work correctly, there is somehow a basic problem: if you want to look for "one specific" value e.g. last price of a stock using bs4 as scraping tool, you not only have to use a "find_all" method, but also a "find" inside of the "find_all"-found records, to get the one specific value.

Let's say the html page contains various 'div's that share the same class, let's call it 'magic-class' , only one of those divs contains the value you need, last price. So you need to find all the divs with that class and then, e.g. with a for-cycle , find each value contained in each div with that class.

So, a part from this problem which is by nature related to the specific structure of the page your intend to scrape, in case you want to store found values inside of a Pandas Dataframe, here is an example you could use as a starting point:

import urllib.request
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import random
from datetime import datetime
import time
from http.cookiejar import CookieJar

price_all = pd.DataFrame()

def checkprice():
    url = "https://www.yourlink.com"
    # Request
    user_agents = [ 
        'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.142 Safari/535.19',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:11.0) Gecko/20100101 Firefox/11.0',
        'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.151 Safari/535.19',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.0.3705)',
        'Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Win 9x 4.90)',
    ]

    cj = CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-agent', user_agents[random.randint(0,4)]) ]
    opener.addheaders = [('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8') ]
    # opener.addheaders = [('Accept-Charset', 'ISO-8859-1,utf-8;q=0.7,*;q=0.3') ]
    # opener.addheaders = [('Accept-Language', 'it-IT,it;q=0.8,en-US;q=0.6,en;q=0.4') ]
    # opener.addheaders = [('Accept-Encoding', 'gzip, deflate, sdch') ]
    response = opener.open(url, timeout= 5)

    #choose one webpage parser
    soup = BeautifulSoup(response,'html.parser')
    # soup = BeautifulSoup(response,'html5lib')
    # soup = BeautifulSoup(response,'lxml')


    found_values = soup.find_all('div', class_='magic-class')
    
    if (len(found_values) > 0):
        number_of_values =  len(found_values)

    else:
        print('No value '+ url)
        return


    list_values = []
    list_timestamps = []

    for n in np.arange(0, number_of_values):
        # Getting the values
        title = found_values[n].find('a').get_text()
        list_values.append(title)
        
        #optional: append timestamp:
        timestamp = datetime.fromtimestamp(time.time())
        list_timestamps.append(timestamp)
       
        df_show_info = pd.DataFrame(
            {'Value': list_values,
            'Time': list_timestamps
            })
        
    return df_show_info


while True:
    price_all = price_all.append(checkprice(), ignore_index=False).copy()
    time.sleep(5)

This will create a general DF called 'price_all' which contains all prices and timestamps checked approx. every 5 seconds. There are more elegant ways to repeat one action every 'x' number of seconds, this is the most basic one.

The technique to use web-scraping tools to obtain prices of financial instruments is quite obsolete and overpassed by other methods, one of most famous ones is Pandas Datareader which is a simple library that provide easy access to a number of online resources that provide financial datas. It perfectly matches the Pandas logic and it's really easy to use.

Does this solve your problem ?

Lorenzo Bassetti
  • 795
  • 10
  • 15