-2

I have a folder structure like this example

Groups 
- apple
-- ahen45.html
-- rev34.html
-- ......

- bat
-- fsf.html
-- ere.html
--....

...

Groups is parent folded. apple,bat etc sub folders

like this more than 500 hundred sub folders and more than 20000 html files there. Now im trying to read those html file through php and separate title , meta keywords, body and the sub folder as category.

<?php
$file =$_SERVER["DOCUMENT_ROOT"];
$dir = new RecursiveDirectoryIterator('groups/',
    FilesystemIterator::SKIP_DOTS);

$it  = new RecursiveIteratorIterator($dir,
    RecursiveIteratorIterator::SELF_FIRST);

$it->setMaxDepth(1);

foreach ($it as $fileinfo) {
    if ($fileinfo->isDir()) {
       echo $category = $fileinfo->getFilename();

    }
    else if ($fileinfo->isFile()) {
        $fileinfo->getFilename();
        $myURL = $file.'/group/groups/'.$category.'/'.$fileinfo->getFilename();

        $doc = new DOMDocument();
        $doc->loadHTMLFile($myURL);

        $elements = $doc->getElementsByTagName('meta');
        $elements = $doc->getElementsByTagName('title');
        $elements = $doc->getElementsByTagName('body'); 

    foreach ($elements as $el) {
            echo $el->nodeValue, PHP_EOL;
    }

    }
}
?>

When I try like this it is checking whole page and give warning like tag(other tags like or ) is unclosed. what can I do to work perfectly?

Wazan
  • 539
  • 1
  • 8
  • 27
  • So what have you tried to this point? What problems are you having? or are you just wanting someone to do this for you? – Mike Brant Sep 05 '13 at 04:44
  • Yes I need a solution to this, I dont have any idea about it – Wazan Sep 05 '13 at 04:47
  • 3
    Well then you have come to the wrong place. It is expected here that you at least put forth some effort to solve the problem yourself before asking a question. – Mike Brant Sep 05 '13 at 04:48

2 Answers2

1

Follow the Procedure:

  1. Read the directory using readdir
  2. Then read all html files by using glob() Refer How to list files and folder in a dir (PHP)
  3. Use get_meta_tags() to get meta tags and for title refer How can I get the title of an HTML page using php? same code used for body you need to change preg_match condition. You can try this too Getting title and meta tags from external website

Try the above points and you get some success in it. Then come with a new question

Community
  • 1
  • 1
Rohan Kumar
  • 40,431
  • 11
  • 76
  • 106
0
<?php
$file =$_SERVER["DOCUMENT_ROOT"];
$dir = new RecursiveDirectoryIterator('groups/',
    FilesystemIterator::SKIP_DOTS);

$it  = new RecursiveIteratorIterator($dir,
    RecursiveIteratorIterator::SELF_FIRST);

$it->setMaxDepth(1);

foreach ($it as $fileinfo) {
    if ($fileinfo->isDir()) {
       echo $category = $fileinfo->getFilename();      
    }
    else if ($fileinfo->isFile()) {
        $fileinfo->getFilename();
        $myURL = $file.'/group/groups/'.$category.'/'.$fileinfo->getFilename();


        $doc = new DOMDocument();
        @$doc->loadHTMLFile($myURL);
        $doc->strictErrorChecking = false;
        $doc->recover=true;
        $doc->formatOutput = true;

        $metas = $doc->getElementsByTagName('meta');        
        $elements1 = $doc->getElementsByTagName('title');
        $elements2 = $doc->getElementsByTagName('body');

            for ($i = 0; $i < $metas->length; $i++)
            {
                $meta = $metas->item($i);
                if($meta->getAttribute('name') == 'keywords'){
                    echo $keywords = $meta->getAttribute('content');
                    echo "<br/>";
                }
            }

            foreach ($elements1 as $el1) {
                echo $el1->nodeValue, PHP_EOL;
                echo "<br/>";
            }
            foreach ($elements2 as $el2) {
                echo $el2->nodeValue, PHP_EOL;
                echo "<br/>";
            }       
    }
    echo "<hr>";
}

?>
Wazan
  • 539
  • 1
  • 8
  • 27