Is it possible to scrape images from a webpage, convert them to numbers and save them to a file?

Question

I do realize that this is 99.999% impossible (I if I am certain that it's 100% impossible I wouldn't have asked the question)

I want to get the all the Lebanese lottery numbers, the only websites I found were this or this or this. I contacted these sites, asking for an excel or a csv file, one didn't reply, one said what you see is what you get, they don't offer files, and the third one, gave me an ods file that has so many missing results and so many incorrect results.

I just want these results for a personal project, since the website admins are not helping me, I either have to hack to their database, which should be the easy if I was an anonymous member, or I have to scrape the images, convert them to numbers and save them into csv files or whatever.

If it was only text, I would've used beautifulsoup, but is it possible to scrape images, convert them to numbers and store them as rows in csv files?

My preferred language is python, but I'd accept anything as long as it does the job.

@furas is on the money - scrape the HTML and look at the image names. For example, the first link's images end with `_##.gif` — Mark Silverberg, Jul 11 '14 at 21:39
@furas of course, testing your answer, most probably gonna be yours, I voted up all the answers thought, all answers are interesting, yours is the easier, it's just almost 2 am so I'm not concentrating right now :) — Lynob, Jul 11 '14 at 22:41

furas · Accepted Answer · 2014-07-11T21:54:50.197

2

import requests
import lxml, lxml.html

r = requests.get('http://www.lldj.com/pastresult.php')

html = lxml.html.fromstring(r.text)

imgs =  html.cssselect('img')

for x in imgs:
    src = x.attrib['src']
    if src.startswith('images/Balls'):
        print src[-6:-4]

result (RESULTS OF DRAW 1212 ON 10/7/2014):

For other page with draw number in url (1154) so you can get any draw

import requests
import lxml, lxml.html

r = requests.get('http://www.lebanon-lotto.com/lebanese-loto-results/draw-number/1154.php')

html = lxml.html.fromstring(r.text)

imgs =  html.cssselect('img')

for x in imgs:
    src = x.attrib['src']
    #print src
    if 'lotto_balls_gray' in src:
        print src[-6:-4]

result:

edited Jul 11 '14 at 21:54

answered Jul 11 '14 at 21:45

furas

134,197
12
106
148

you forgot to `import cssselect` – Lynob Jul 14 '14 at 00:00
A small question, how to get the old results, say 1 and 2 draws? this site doesn't have them http://www.lebanon-loto.com/past_results_list.php and on the other sites, the url isn't changing, it's all ajax and stuff, how to get passed results from there – Lynob Jul 14 '14 at 00:11
Change number in url `http://www.lebanon-lotto.com/lebanese-loto-results/draw-number/1154.php` - it's draw number. – furas Jul 14 '14 at 00:25
yes but if I want say `http://www.lebanon-lotto.com/lebanese-loto-results/draw-number/23.php` that draw doesn't exist on that site, that site has only draws from early 2013 or 2012, the lebanese lotto has been since 2002 updated on websites, so these results have to be taken from here http://www.lldj.com/pastresult.php or here http://www.playlebanon.com/webservices/website/lotto/default.aspx – Lynob Jul 14 '14 at 00:31
For sure try 0023.php - if there are links to past draws check it - maybe they use different urls in the past. – furas Jul 14 '14 at 00:34
okay I contacted the admin of that site, to see if he has those results, I'll get back to u soon, thanks – Lynob Jul 14 '14 at 00:36
First read they documentations (or something) maybe scraping from its pages is out of the (its) law :) – furas Jul 14 '14 at 00:38
http://www.lldj.com use `POST` method to sent draws number and get results - `requests` can send `POST` but you have to get information what is send to server - you can use Firebug (extension in Firefox) to see what is send from browser to server. – furas Jul 14 '14 at 00:43

score 1 · Answer 2 · answered Jul 11 '14 at 21:45

It is certainly possible. In python, you can use the scikit library (http://scikit-image.org/); with it, you can "read" an image and save it as a matrix of numbers. For this purpose it would be better to save the image as "black and white", that way you would have a single matrix, with each number corresponding to a pixel, the values would range from 0 to 255 in a gray scale. From this matrix you could identify the number patterns and save them as text. It is a lot of work, but it is definitely doable.

Matlab also easily "reads" images and turns them into matrices.

score 1 · Answer 3 · edited May 23 '17 at 11:50

1

To start you off, I would look into HtmlAgilityPack for the image scraping. Example implementation here. And later I would use python-tesseract wrapper for the tesseract-ocr (C++ library) for the optical character recognition.

edited May 23 '17 at 11:50

Community

1
1

answered Jul 11 '14 at 21:46

etr

1,252
2
8
15

Is it possible to scrape images from a webpage, convert them to numbers and save them to a file?

3 Answers3