14
  • What's the best way to store a formatted html page with CSS on to MYSQL database? Is it possible?
  • What the column type should be? How to retrieve the stored formatted HTML and display it correctly using PHP?

  • What if the page I would like to fetch has pics and videos, show I store the page as blob

  • What's the best way to fetch a page using PHP-CURL,fopen,..-?

Many questions guys but I really need your help to put me on the right way to do it.

Thanks a lot.

codemaker
  • 143
  • 1
  • 1
  • 5

5 Answers5

8

Quite simple, try this code I made for you.

It's the basics to grab and save the source in a DB.

I didn't put error handling or whatever else, just keep it simple for the moment...

I didn't made the function to show the result, but you can print the $source to view the result.

Hope this will help you.

<?php

function GetPage($URL)
{
    #Get the source content of the URL
    $source = file_get_contents($URL);

    #Extract the raw URl from the current one
    $scheme = parse_url($URL, PHP_URL_SCHEME); //Ex: http
    $host = parse_url($URL, PHP_URL_HOST); //Ex: www.google.com
    $raw_url = $scheme . '://' . $host; //Ex: http://www.google.com

    #Replace the relative link by an absolute one
    $relative = array();
    $absolute = array();

    #String to search
    $relative[0] = '/src="\//';
    $relative[1] = '/href="\//';

    #String to remplace by
    $absolute[0] = 'src="' . $raw_url . '/';
    $absolute[1] = 'href="' . $raw_url . '/';

    $source = preg_replace($relative, $absolute, $source); //Ex: src="/image/google.png" to src="http://www.google.com/image/google.png"

    return $source;
}

function SaveToDB($source)
{
    #Connect to the DB
    $db = mysql_connect('localhost', 'root', '');

    #Select the DB name
    mysql_select_db('test');

    #Ask for UTF-8 encoding
    mysql_query("SET NAMES 'utf8'");

    #Escape special chars
    $source = mysql_real_escape_string($source);

    #Set the Query
    $query = "INSERT INTO website (source) VALUES ('$source')"; //Save it in a text row, that's it...

    #Run the query
    mysql_query($query);

    #Close the connection
    mysql_close($db);
}

$source = GetPage('http://www.google.com');

SaveToDB($source);

?>
geek1983
  • 154
  • 1
  • 6
  • Thanks a lot for the code. I need to store a formatted web page with CSS and pics so when I echo it, the result will be a formatted web page just like the original. I don't thank the code above would do that. Please correct me if I'm wrong. – codemaker May 04 '10 at 19:08
  • Yes it will, try it yourself, use: echo GetPage('http://www.google.com'); You will see the identical page as google. If thats not what you want, I didn't understand your request... – geek1983 May 04 '10 at 19:34
  • Thanks, I'm going to give it a try. – codemaker May 04 '10 at 19:43
1

Pull down the whole page using fopen and parse out any URLs (like images and css). You'll want to run a loop to grab each of the urls for files that generate the page. Store these as well, and replace the urls that used to link to the other sites files with your new links. (this will avoid any issues if the files should change or be removed in the future).

I'd recomend using a blob datatype just because it would allow you store all the files in one table, but you could do a table for the pages with a text datatype and another with blob to store images and other files.

Edit: If you are storing as a blob datatype look into base64_encode() it will increase the storage footprint on the server but you'll avoid any issues with quotes and special characters.

Mestore
  • 165
  • 1
  • 1
  • 8
  • Correct me if I'm wrong please, you suggest to parse the page in 2 steps. First without links to css and images and second with the links. My question is how should I but the whole think together and store it in a blob and then retrieve and display it with the correct formaat? Would you please explain more? – codemaker May 03 '10 at 22:32
  • You cannot save the whole page as one file. You need to collect links within the page(css, javascript, images ect.) Then fopen and save those files locally. A lot of the links will be relative, modify them so fopen can open the files. Once those files have been saved locally change the links in the html to your local links. You'll also have to check any javascript and css for links as well and repeat the process for those files. ~I assume you are using this to rip pages from other sites (similar to http://www.archive.org/) and not using it to store templates created locally. – Mestore May 04 '10 at 00:23
  • Do you know a speedy HTML parser implemented in PHP to achieve the task? – codemaker May 04 '10 at 12:08
  • I've never used it, but I believe http://sourceforge.net/projects/simplehtmldom/ is fairly easy to setup and should allow you to change the html to fit your needs. There are lots of parsers out there, and a quick google query will find most of them. – Mestore May 04 '10 at 14:53
1

Don't use a relation database to store files. Use a filesystem or a NoSQL solution.

You might want to look into the various open source spider that are available (htdig and httrack come to mind).

NeuroScr
  • 322
  • 1
  • 7
1

I'd store the URLs in a database, and make a cron job to wget the pages regularly, storing them in their own keyed local directories. Using wget will allow you to cache the page, and optionally cache its images, scripts, etc... as well. You can also have your wget command change the embedded URLs so that you don't have to cache everything.

Here is the man page for wget, you may also consider searching for "wget backup website" or similar.

(By "keyed directories" I mean that your database table would have 2 fields, a 'key' and a 'url', the [unique] 'key' would then be the path where you archive the website to using wget.)

Geoff
  • 7,935
  • 3
  • 35
  • 43
  • Why not, since a URL of a web page is very small in size, I see no problem in saving the content of the page in a text or blob. I believe fetching a 60KB or so from a datbase would be more speedy than a local harddisk. – codemaker May 04 '10 at 19:01
-1

You can store the data as text datatype in mysql
but you have to convert the data bcz page may content many quotes and special characters.
you can see this question THIS Its not exact to your question but it will help when you will store the data in database.
about that images and videos...if you are storing page content then there will be only paths of that images and videos.. so no problem will come when you will store in database.

Community
  • 1
  • 1
Nitz
  • 1,690
  • 11
  • 36
  • 56
  • -1 for being mostly unreadable and largely wrong. Dealing with quotes does not require that the data be "converted", merely that you perform the standard, routine approaches for inserting data into a database. Additionally, relative URIs will break as soon as the HTML is moved away from its original URI. – Quentin May 03 '10 at 21:55
  • 1
    When you had data with their styles and data with many quotes...then you will get my point. i think you page content no quotes or no stylesheets. mostly when you stored the data which will be entered by the user then you don't know what they will enter. so if you don't like then it ok.... if your data is only entered by you then you will take care of the quotes. quotes will come in problem when you will fire query. – Nitz May 04 '10 at 03:50