42

for my website, i'd like to add a new functionality.

I would like user to be able to upload his bookmarks backup file (from any browser if possible) so I can upload it to their profile and they don't have to insert all of them manually...

the only part i'm missing to do this it's the part of extracting title and URL from the uploaded file.. can anyone give a clue where to start or where to read?

used search option and (How to extract data from a raw HTML file?) this is the most related question for mine and it doesn't talk about it..

I really don't mind if its using jquery or php

Thank you very much.

simhumileco
  • 31,877
  • 16
  • 137
  • 115
Toni Michel Caubet
  • 19,333
  • 56
  • 202
  • 378
  • 1
    it would probably help everyone if you could put up examples of the types of bookmark backup files you'd like to support (for each browser) – scoates Dec 12 '10 at 18:41
  • 1
    The Netscape format is common: http://msdn.microsoft.com/en-us/library/aa753582(v=vs.85).aspx – Matthew Dec 12 '10 at 18:56

6 Answers6

89

Thank you everyone, I GOT IT!

The final code:

$html = file_get_contents('bookmarks.html');
//Create a new DOM document
$dom = new DOMDocument;

//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);

//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');

//Iterate over the extracted links and display their URLs
foreach ($links as $link){
    //Extract and show the "href" attribute.
    echo $link->nodeValue;
    echo $link->getAttribute('href'), '<br>';
}

This shows you the anchor text assigned and the href for all links in a .html file.

Again, thanks a lot.

simhumileco
  • 31,877
  • 16
  • 137
  • 115
Toni Michel Caubet
  • 19,333
  • 56
  • 202
  • 378
44

This is probably sufficient:

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node)
{
  echo $node->nodeValue.': '.$node->getAttribute("href")."\n";
}
Matthew
  • 47,584
  • 11
  • 86
  • 98
  • 2
    whre $html it's the path to the file? Thanks for such a quick answer :D – Toni Michel Caubet Dec 12 '10 at 18:53
  • 2
    @Toni, `$html` is the string containing the HTML. You can use `$dom->loadHTMLFile()` to load directly from a file. (You may want to prefix it with `@` to suppress warnings.) – Matthew Dec 12 '10 at 18:54
  • wow! thank you very much! seems like its almost done! I can get links but i am having troubles with names or titles (I tried both) – Toni Michel Caubet Dec 12 '10 at 19:06
  • I don't know what you mean by names or titles. The `$node->nodeValue` is the name of the bookmark. – Matthew Dec 12 '10 at 20:03
6

Assuming the stored links are in a html file the best solution is probably to use a html parser such as PHP Simple HTML DOM Parser (never tried it myself). (The other option is to search using basic string search or regexp, and you should probably never use regexp to parse html).

After reading the html file using the parser use it's functions to find the a tags:

from the tutorial:

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 
Simon Groenewolt
  • 10,607
  • 1
  • 36
  • 64
2
$html = file_get_contents('your file path');

$dom = new DOMDocument;

@$dom->loadHTML($html);

$styles = $dom->getElementsByTagName('link');

$links = $dom->getElementsByTagName('a');

$scripts = $dom->getElementsByTagName('script');

foreach($styles as $style)
{

    if($style->getAttribute('href')!="#")

    {
        echo $style->getAttribute('href');
        echo'<br>';
    }
}

foreach ($links as $link){

    if($link->getAttribute('href')!="#")
    {
        echo $link->getAttribute('href');
        echo'<br>';
    }
}

foreach($scripts as $script)
{

        echo $script->getAttribute('src');
        echo'<br>';

}
KittMedia
  • 7,368
  • 13
  • 34
  • 38
1

I wanted to create a CSV of link paths and their text from html pages so I could rip menus etc from sites.

In this example you specify the domain you are interested in so you don't get off site links and then it produces a CSV per document

/**
 * Extracts links to the given domain from the files and creates CSVs of the links
 */


$LinkExtractor = new LinkExtractor('https://www.example.co.uk');

$LinkExtractor->extract(__DIR__ . '/hamburger.htm');
$LinkExtractor->extract(__DIR__ . '/navbar.htm');
$LinkExtractor->extract(__DIR__ . '/footer.htm');

class LinkExtractor {
    public $domain;

    public function __construct($domain) {
      $this->domain = $domain;
    }

    public function extract($file) {
        $html = file_get_contents($file);
        //Create a new DOM document
        $dom = new DOMDocument;

        //Parse the HTML. The @ is used to suppress any parsing errors
        //that will be thrown if the $html string isn't valid XHTML.
        @$dom->loadHTML($html);

        //Get all links. You could also use any other tag name here,
        //like 'img' or 'table', to extract other tags.
        $links = $dom->getElementsByTagName('a');

        $results = [];
        //Iterate over the extracted links and display their URLs
        foreach ($links as $link){
            //Extract and sput the matching links in an array for the CSV
            $href = $link->getAttribute('href');
            $parts = parse_url($href);
            if (!empty($parts['path']) && strpos($this->domain, $parts['host']) !== false) {
                $results[$parts['path']] = [$parts['path'], $link->nodeValue];
            }
        }

        asort($results);
        // Make the CSV
        $fp = fopen($file .'.csv', 'w');
        foreach ($results as $fields) {
            fputcsv($fp, $fields);
        }
        fclose($fp);
    }
}
Tom Gould
  • 41
  • 1
  • What the OP is asking for support with has already been answered years ago in the same fashion as in your answer. The fact that you are implementing a larger task does not make it relevant to this question. If you can find a question that is seeking all of the functionality that you are providing, please post your answer there. Stackoverflow's usablility as a researcher's tool is damaged when there are many posts that give the same answer on the same page because a thorough researcher will waste time reading redundant advice. – mickmackusa Jun 22 '19 at 05:35
0

Here is my work for one of my client and make it as a function to use everywhere.

function getValidUrlsFrompage($source)
  {
    $links = [];
    $content = file_get_contents($source);
    $content = strip_tags($content, "<a>");
    $subString = preg_split("/<\/a>/", $content);
    foreach ($subString as $val) {
      if (strpos($val, "<a href=") !== FALSE) {
        $val = preg_replace("/.*<a\s+href=\"/sm", "", $val);
        $val = preg_replace("/\".*/", "", $val);
        $val = trim($val);
      }
      if (strlen($val) > 0 && filter_var($val, FILTER_VALIDATE_URL)) {
        if (!in_array($val, $links)) {
          $links[] = $val;
        }
      }
    }
    return $links;
  }

And use it like

$links = getValidUrlsFrompage("https://www.w3resource.com/");

And The expected output is get 99 URLs in an array,

Array ( [0] => https://www.w3resource.com [1] => https://www.w3resource.com/html/HTML-tutorials.php [2] => https://www.w3resource.com/css/CSS-tutorials.php [3] => https://www.w3resource.com/javascript/javascript.php [4] => https://www.w3resource.com/html5/introduction.php [5] => https://www.w3resource.com/schema.org/introduction.php [6] => https://www.w3resource.com/phpjs/use-php-functions-in-javascript.php [7] => https://www.w3resource.com/twitter-bootstrap/tutorial.php [8] => https://www.w3resource.com/responsive-web-design/overview.php [9] => https://www.w3resource.com/zurb-foundation3/introduction.php [10] => https://www.w3resource.com/pure/ [11] => https://www.w3resource.com/html5-canvas/ [12] => https://www.w3resource.com/course/javascript-course.html [13] => https://www.w3resource.com/icon/ [14] => https://www.w3resource.com/linux-system-administration/installation.php [15] => https://www.w3resource.com/linux-system-administration/linux-commands-introduction.php [16] => https://www.w3resource.com/php/php-home.php [17] => https://www.w3resource.com/python/python-tutorial.php [18] => https://www.w3resource.com/java-tutorial/ [19] => https://www.w3resource.com/node.js/node.js-tutorials.php [20] => https://www.w3resource.com/ruby/ [21] => https://www.w3resource.com/c-programming/programming-in-c.php [22] => https://www.w3resource.com/sql/tutorials.php [23] => https://www.w3resource.com/mysql/mysql-tutorials.php [24] => https://w3resource.com/PostgreSQL/tutorial.php [25] => https://www.w3resource.com/sqlite/ [26] => https://www.w3resource.com/mongodb/nosql.php [27] => https://www.w3resource.com/API/google-plus/tutorial.php [28] => https://www.w3resource.com/API/youtube/tutorial.php [29] => https://www.w3resource.com/API/google-maps/index.php [30] => https://www.w3resource.com/API/flickr/tutorial.php [31] => https://www.w3resource.com/API/last.fm/tutorial.php [32] => https://www.w3resource.com/API/twitter-rest-api/ [33] => https://www.w3resource.com/xml/xml.php [34] => https://www.w3resource.com/JSON/introduction.php [35] => https://www.w3resource.com/ajax/introduction.php [36] => https://www.w3resource.com/html-css-exercise/index.php [37] => https://www.w3resource.com/javascript-exercises/ [38] => https://www.w3resource.com/jquery-exercises/ [39] => https://www.w3resource.com/jquery-ui-exercises/ [40] => https://www.w3resource.com/coffeescript-exercises/ [41] => https://www.w3resource.com/php-exercises/ [42] => https://www.w3resource.com/python-exercises/ [43] => https://www.w3resource.com/c-programming-exercises/ [44] => https://www.w3resource.com/csharp-exercises/ [45] => https://www.w3resource.com/java-exercises/ [46] => https://www.w3resource.com/sql-exercises/ [47] => https://www.w3resource.com/oracle-exercises/ [48] => https://www.w3resource.com/mysql-exercises/ [49] => https://www.w3resource.com/sqlite-exercises/ [50] => https://www.w3resource.com/postgresql-exercises/ [51] => https://www.w3resource.com/mongodb-exercises/ [52] => https://www.w3resource.com/twitter-bootstrap/examples.php [53] => https://www.w3resource.com/euler-project/ [54] => https://w3resource.com/w3skills/html5-quiz/ [55] => https://w3resource.com/w3skills/php-fundamentals/ [56] => https://w3resource.com/w3skills/sql-beginner/ [57] => https://w3resource.com/w3skills/python-beginner-quiz/ [58] => https://w3resource.com/w3skills/mysql-basic-quiz/ [59] => https://w3resource.com/w3skills/javascript-basic-skill-test/ [60] => https://w3resource.com/w3skills/javascript-advanced-quiz/ [61] => https://w3resource.com/w3skills/javascript-quiz-part-iii/ [62] => https://w3resource.com/w3skills/mongodb-basic-quiz/ [63] => https://www.w3resource.com/form-template/ [64] => https://www.w3resource.com/slides/ [65] => https://www.w3resource.com/convert/number/binary-to-decimal.php [66] => https://www.w3resource.com/excel/ [67] => https://www.w3resource.com/video-tutorial/php/some-basics-of-php.php [68] => https://www.w3resource.com/video-tutorial/javascript/list-of-tutorial.php [69] => https://www.w3resource.com/web-development-tools/firebug-tutorials.php [70] => https://www.w3resource.com/web-development-tools/useful-web-development-tools.php [71] => https://www.facebook.com/w3resource [72] => https://twitter.com/w3resource [73] => https://plus.google.com/+W3resource [74] => https://in.linkedin.com/in/w3resource [75] => https://feeds.feedburner.com/W3resource [76] => https://www.w3resource.com/ruby-exercises/ [77] => https://www.w3resource.com/graphics/matplotlib/ [78] => https://www.w3resource.com/python-exercises/numpy/index.php [79] => https://www.w3resource.com/python-exercises/pandas/index.php [80] => https://w3resource.com/plsql-exercises/ [81] => https://w3resource.com/swift-programming-exercises/ [82] => https://www.w3resource.com/angular/getting-started-with-angular.php [83] => https://www.w3resource.com/react/react-js-overview.php [84] => https://www.w3resource.com/vue/installation.php [85] => https://www.w3resource.com/jest/jest-getting-started.php [86] => https://www.w3resource.com/numpy/ [87] => https://www.w3resource.com/php/composer/a-gentle-introduction-to-composer.php [88] => https://www.w3resource.com/php/PHPUnit/a-gentle-introduction-to-unit-test-and-testing.php [89] => https://www.w3resource.com/laravel/laravel-tutorial.php [90] => https://www.w3resource.com/oracle/index.php [91] => https://www.w3resource.com/redis/index.php [92] => https://www.w3resource.com/cpp-exercises/ [93] => https://www.w3resource.com/r-programming-exercises/ [94] => https://w3resource.com/w3skills/ [95] => https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en_US [96] => https://www.w3resource.com/privacy.php [97] => https://www.w3resource.com/about.php [98] => https://www.w3resource.com/contact.php [99] => https://www.w3resource.com/feedback.php [100] => https://www.w3resource.com/advertise.php )

Hope, this will help someone. And here is a gist - https://gist.github.com/ManiruzzamanAkash/74cffb9ffdfc92f57bd9cf214cf13491

Maniruzzaman Akash
  • 4,610
  • 1
  • 37
  • 34