1

I am trying to do the following with PHP...

  1. Read a directory
  2. Find all .md and .markdown files
  3. Read the first 2 lines of these Markdown files.
  4. If a Title: Title for the file is found on line 1 then add it to the array
  5. If a Description: Short description is found on line 2 then add it to the array
  6. If a Sub-directory is found, repeat steps 1-5 on them
  7. Should now have a nice list/array
  8. Print this list/array to screen to show up like this....

Directory 1 Name

<a href="LINK TO MARKDOWN FILE 1"> TITLE from line 1 of Markdown FILE 1</a> <br>
Description from Markdown FILE 1 line 2

<a href="LINK TO MARKDOWN FILE 2"> TITLE from line 1 of Markdown FILE 1</a> <br>
Description from Markdown FILE 2 line 2

<a href="LINK TO MARKDOWN FILE 3"> TITLE from line 1 of Markdown FILE 1</a> <br>
Description from Markdown FILE 3 line 2

Directory 2 Name

<a href="LINK TO MARKDOWN FILE 1"> TITLE from line 1 of Markdown FILE 1</a> <br>
Description from Markdown FILE 1 line 2

<a href="LINK TO MARKDOWN FILE 2"> TITLE from line 1 of Markdown FILE 1</a> <br>
Description from Markdown FILE 2 line 2

<a href="LINK TO MARKDOWN FILE 3"> TITLE from line 1 of Markdown FILE 1</a> <br>
Description from Markdown FILE 3 line 2

etc..........

Code so far

function getFilesFromDir($dir)
{
    $files = array();
    //scan directory passsed into function
    if ($handle = opendir($dir)) {
        while (false !== ($file = readdir($handle))) {

            // If file is .md or .markdown continue
            if (preg_match('/\.(md|markdown)$/', $file)) {

                // Grab first 2 lines of Markdown file
                $content = file($dir . '/' . $file);
                $title = $content[0];
                $description = $content[1];

                // If first 2 lines of Markdown file have a 
                // "Title: file title" and "Description: file description" lines we then
                // add these key/value pairs to the array for meta data

                // Match Title line
                $pattern = '/^(Title|Description):(.+)/';
                if (preg_match($pattern, $title, $matched)) {
                    $title = trim($matched[2]);
                }

                // match Description line 
                if (preg_match($pattern, $description, $matched)) {
                    $description = trim($matched[2]);
                }

                // Add .m and .markdown files and folder path to array
                // Add captured Title and Description to array as well
                $files[$dir][] = array("filepath" => $dir . '/' . $file,
                                       "title" => $title,
                                       "description" => $description
                                    );

            }
        }
        closedir($handle);
    }

    return $files;
}

Usage

$dir = 'mdfiles';
$fileArray = getFilesFromDir($dir);

Help needed

So far the code just needs to add the ability to do what it does on sub-directories and the way that it matches the first 2 lines of code and then runs the regex 2 times, can probably be done differently?

I would think there is a better way so that the REGEX I have to match the Title and Description can be run just once?

Can someone help me modify to make this code detect and run on sub-directories as well as improve the way it reads the first 2 lines of a markdown file to get the title and description if they exist?

Also need help printing the array to screen to make it not only just show the dat, I know how to do that but has to break the files up to show the Folder name at the top of each set like in my demo output above.

I appreciate any help

JasonDavis
  • 48,204
  • 100
  • 318
  • 537

2 Answers2

2

To recursively iterate over files, the RecursiveDirectoryIterator is quite handy (related: PHP recursive directory path). It already offers an easy access to FileSystemObject as well which looks useful in your case as you want to obtain the files content.

Additionally it's possible to run one regular expression to parse the first two lines of the file, as patterns get cached when you execute them more often, it should be fine. One pattern has the benefit that the code is more structured, but the downside that the pattern is more complex. Configuration could look like this:

#
# configuration
#

$path = 'md';
$fileFilter = '~\.(md|markdown)$~';
$pattern = '~^(?:Title: (.*))?(?:(?:\r\n|\n)(?:Description: (.*)))?~u';

Just in case the markdown files are actually UTF-8 encoded, I added the u-modifier (PCRE8).

The processing part of the code is then using a recursive directory iterator over $path, skips files not matching $fileFilter and then parses the first two lines of each file (if a file is at least readable and has at least one line) and stores it into a directory based hashtable/array $result:

#
# main
#

# init result array (the nice one)
$result = array();

# recursive iterator for files
$iterator = new RecursiveIteratorIterator(
               new RecursiveDirectoryIterator($path, FilesystemIterator::KEY_AS_PATHNAME | FilesystemIterator::CURRENT_AS_FILEINFO), 
               RecursiveIteratorIterator::SELF_FIRST);

foreach($iterator as $path => $info)
{
    # filter out files that don't match
    if (!preg_match($fileFilter, $path)) continue;

    # get first two lines
    try
    {
        for
        (
            $maxLines = 2,
            $lines = '',
            $file = $info->openFile()
            ; 
            !$file->eof() && $maxLines--
            ; 
            $lines .= $file->fgets()
        );
        $lines = rtrim($lines, "\n");

        if (!strlen($lines)) # skip empty files 
            continue;
    }
    catch (RuntimeException $e)
    {
        continue; # files which are not readable are skipped.
    }

    # parse md file
    $r = preg_match($pattern, $lines, $matches);
    if (FALSE === $r)
    {
        throw new Exception('Regular expression failed.');
    }
    list(, $title, $description) = $matches + array('', '', '');

    # grow result array
    $result[dirname($path)][] = array($path, $title, $description);
}

What's left is the output. As the hashtable is pre-ordered by the directory hash, it's fairly straight forward by first iterating over the directories and then over the files within:

#
# output
#

$dirCounter = 0;
foreach ($result as $name => $dirs)
{
    printf("Directory %d %s\n", ++$dirCounter, basename($name));
    foreach ($dirs as $entry)
    {
        list($path, $title, $description) = $entry;
        printf("<a href='%s'>%s from line 1 of Markdown %s</a> <br>\n%s\n\n", 
                htmlspecialchars($path), 
                htmlspecialchars($title),               
                htmlspecialchars(basename($path)),
                htmlspecialchars($description)
              );
    }
}
Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • Impressive, I would definitively call this the advanced version, thank you – JasonDavis Dec 17 '11 at 14:49
  • You're welcome. In case the default SPL recursive directory iterator is too slow (which can happen with large directory structures on some operating systems), you can even easily replace it with a faster iterator, e.g. stack based is very fast (but you need to code it yourself, no built-in SPL class for that). See as well [6 Methoden, ein Verzeichnis rekursiv zu scannen](http://www.phpgangsta.de/6-methoden-ein-verzeichnis-rekursiv-zu-scannen) (German). – hakre Dec 17 '11 at 15:02
  • I think it will do just fine for my use, however I am always interested in learning more. Would you say this SPL iterator is faster or slower then the regular `opendir` `readdir` methods? I am thinking about possibly doing some caching as well and only updating when new files are added or edited – JasonDavis Dec 17 '11 at 15:14
  • First, I would not cache within the process. You could cache the overall output if you add it as some documentation to your project, so you can just put it into your build script. That's more straight forward and more flexible. I would consider the iterator a good method, speed and programming wise. It's quite fast, uses sort of stacks internally which is better than pure recursion. Additionally in case you run into a bottleneck, it could be easily replaced. Which would not be possible if you hardcode opendir and readdir firsthand. So you already get much for little, which is a good sign. – hakre Dec 17 '11 at 15:38
1

This should work:

if (preg_match('/\.(md|markdown)$/', $file)) {
   // ...
} elseif (is_dir($file)) {
    $files = array_merge($files, getFilesFromDir($dir . '/' . $file));
}

Running the regex twice isn't so bad, and may be better than trying to hash something together across both lines. However you could achieve the same result with preg_replace:

$title = trim(preg_replace('/^Title:(.+)/', '$1', $content[0]));
$description = trim(preg_replace('/^Description:(.+)/', '$1', $content[1]));

For outputting your array as per the example, this this:

foreach ($filesArray as $directory => $files) {
    echo $directory . "\n\n";

    foreach ($files as $fileData) {
        echo '<a href="' . $fileData['filepath'] . '">' . $fileData['title'] . "</a><br />\n";
        echo $fileData['description'] . "\n\n";
    }
}
cmbuckley
  • 40,217
  • 9
  • 77
  • 91