Regex erase control characters except between words

Question

I got an array from a database with products and I want to split/seperate the large description text below in smaller chunks of product attribute name and value. Ultimately I'm striving for database normalization, as I'm currently trying to create a import tool for 2 different database designs.

The array I get from old product table:

        Array

            (
                [0] => Array
                    (
                        [product_id] => 219
                        [product_description] =>
<table style="color:; text-align: left;">
<tr>
<td>
Processor:
</td>
<td>
        Intel Core 2 Duo - E8400
</td>
</tr>
<tr>
<td>
Clock speed:
</td>
<td>
        3.0 GHz
</td>
</tr>
<tr>
<td>
Memory:
</td>
<td>
        4 GB
</td>
</tr>
<tr>
<td>
Hard disk:
</td>
<td>
        250 GB
</td>
</tr>
<tr>
<td>
Video-adapter:
</td>
<td>
        VGA, Display
</td>
</tr>
<tr>
<td>
Netwerk card:
</td>
<td>
        1000 Mbps LAN
</td>
</tr>
<tr>
<td>
Optical drive:
</td>
<td>
        DVD-Rewriter
</td>
</tr>
<tr>
<td>
Operating system:
</td>
<td>
        Windows 7 or 10 Pro
</td>
</tr>
<tr>
<td>
Warranty:
</td>
<td>
        1 year
</td>
</tr>
</table>
                    )
            )

My code so far:

$sth = $dbh->prepare("SELECT * from products WHERE product_status_id = '1' ORDER BY order_num ASC");
$sth->execute();
$result = $sth->fetchAll(PDO::FETCH_ASSOC);

$output = array();

$tdpattern = "!<td>(.*?)</td>!is";

foreach ($result as $key=>$val)  {
    preg_match_all($tdpattern, $val['product_description'], $result);
    foreach ($result as $key => $arr) {
        foreach ($arr as $key2 => $description) {
            $output[] = preg_replace('/\n^[\x0a\x20]+|[\x0a\x20]+$/','',$description);
        }
    }
}

// return $output to controller

As you can see below, the ouput shows multiple spaces infront of words but not between them, there are also newlines that should be erased. How can I erase all those control characters such as line feeds and whitespaces except 1 space between words for every array element so ideally it becomes like the layout at the bottom?

Array
(
    [0] => Processor
    [1] =>         IntelCore2-E5500
    [2] => Clockspeed
    [3] =>         2.93GHz
    [4] => Memory
    [5] =>         4GB
    [6] => Harddisk
    [7] =>         250GB
    [8] => Video-adapter
    [9] =>         VGA,Display
    [10] => Netwerkcard
    [11] =>         1000mbpsLAN
    [12] => Opticaldrive
    [13] =>         DVD-Rewriter
    [14] => Operatingsystem
    [15] =>         Windows7or10Pro
    [16] => Warranty
    [17] =>         2jaar
)

I wish to convert it to this layout:

[219] => array (
    [product_description] => array (
        [processor] => Intel Core 2 - E5500
        [clock speed] => 2.93 GHz
        [memory] => 2.93 GHz
        [hard disk] => 2.93 GHz
        [video adapter] => 2.93 GHz
        [network card] => DVD Rewriter
        [optical drive] => DVD Rewriter
        [operating system] => Windows 7 or 10 Pro
        [warranty] = > 2 years
    )
)

Some directions would be great, specifically how to improve the regex.

Regx is a poor choice for parsing "HTML", you'll have to use `preg_match_all` I would make a tokenizer for it or use something like `PHPQuery` — ArtisticPhoenix, Dec 02 '17 at 00:03
Here is a Tokenizer/lexer I wrote for `Json` you can do similar for this. https://github.com/ArtisticPhoenix/MISC/blob/master/JasonDecoder.php Maybe I will modify it for you... — ArtisticPhoenix, Dec 02 '17 at 00:06
Possible duplicate of [Remove multiple whitespaces](https://stackoverflow.com/questions/2326125/remove-multiple-whitespaces) — guido, Dec 02 '17 at 00:09

Lawrence Cherone · Accepted Answer · 2017-12-02T01:16:12.810

Don't use regex to parse HTML, use DomDocument.

<?php
//...
$result = $sth->fetchAll(PDO::FETCH_ASSOC);

$dom_err = libxml_use_internal_errors(true);
$dom = new DOMDocument();

foreach ($result as $key => $val)  {

    // fix product_description
    $product_description = [];
    if (!empty($val['product_description'])) {
        $html = $val['product_description'];

        // proccess
        $dom->loadHTML($html);
        foreach ($dom->getElementsByTagName('td') as $i => $td) {
            if ($i % 2 == 0) {
                $label = strtolower(trim($td->nodeValue));
                $label = str_replace('-', ' ', trim($label, ':'));
            } else {
                $product_description[$label] = trim($td->nodeValue);
            }
        }
    }
    $val['product_description'] = $product_description;

    // ... rest
}

libxml_clear_errors();
libxml_use_internal_errors($dom_err);

Example:

https://3v4l.org/vECil

Result:

Array
(
    [processor] => Intel Core 2 Duo - E8400
    [clock speed] => 3.0 GHz
    [memory] => 4 GB
    [hard disk] => 250 GB
    [video adapter] => VGA, Display
    [netwerk card] => 1000 Mbps LAN
    [optical drive] => DVD-Rewriter
    [operating system] => Windows 7 or 10 Pro
    [warranty] => 1 year
)

score 0 · Answer 2 · answered Dec 02 '17 at 00:07

Source: https://stackoverflow.com/a/2326239/5245032

<?php
$str = "This is  a string       with
spaces, tabs and newlines present";

$stripped = preg_replace(array('/\s{2,}/', '/[\t\n]/'), ' ', $str);

echo $str;
echo "\n---\n";
echo "$stripped";
?>

This outputs

This is  a string   with
spaces, tabs and newlines present
---
This is a string with spaces, tabs and newlines present

slevy1 · Answer 3 · 2017-12-02T23:51:52.620

Given an array that appears as follows:

<?php

$a = [ 0 => [ "product_id" => 219,
              "product_description" => "<table style=\"color:; text-align: left;\">
<tr>
<td>
Processor:
</td>
<td>
        Intel Core 2 Duo - E8400
</td>
</tr>
<tr>
<td>
Clock speed:
</td>
<td>
        3.0 GHz
</td>
</tr>
<tr>
<td>
Memory:
</td>
<td>
        4 GB
</td>
</tr>
<tr>
<td>
Hard disk:
</td>
<td>
        250 GB
</td>
</tr>
<tr>
<td>
Video-adapter:
</td>
<td>
        VGA, Display
</td>
</tr>
<tr>
<td>
Netwerk card:
</td>
<td>
        1000 Mbps LAN
</td>
</tr>
<tr>
<td>
Optical drive:
</td>
<td>
        DVD-Rewriter
</td>
</tr>
<tr>
<td>
Operating system:
</td>
<td>
        Windows 7 or 10 Pro
</td>
</tr>
<tr>
<td>
Warranty:
</td>
<td>
        1 year
</td>
</tr>
</table>"]
     ];

You may use the DOM to pase it or other libraries. Also, the string value may be processed using a variety of PHP built-in functions, as illustrated in this example:

<?php

$arr = array_pop($a);
$str =  $arr["product_description"];

$stripped = strip_tags( $str, "<td>" );
$replaced = str_replace( "</td>", "", $stripped );
$arr = explode( "<td>", $replaced );
array_shift( $arr );

$arrKeyVal=[];

for( $i=0, $max = count( $arr ); $i < $max; $i+=2 ) {
       $key = trim( $arr[$i],"\r\t\n :" );
       $arrKeyVal[strtolower( $key )] = trim( $arr[$i+1] );
}
print_r( $arrKeyVal );

See live code

The code uses array_pop() to extract the nested associative array. The "product_description" element's value is assigned to $str for easier handling. The string is stripped of all tags except the "<td>". The closing td tag is replaced by an empty string. The string is then split on the opening td tag. The resulting array has an initial element that is empty so it is shifted off the array. Then the code uses a loop to construct an associative array such that based on an array element's parity it is either a key or a value in the array. Also, every element of $arr is trimmed so that superfluous white space is removed as well as the trailing colon. And, strtolower() makes sure that every key appears in lowercase.

Using a regex is generally inadvisable; see here.

Thanks, a dom parser is indeed the way to go. I did had to change it a bit for example: $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8')); and then to remove the nbsp charachters in the foreach loop $td->nodeValue = preg_replace( "~\x{00a0}~siu", "", $td->nodeValue ); and now I have it as I wanted. — Xenioz, Dec 08 '17 at 09:10

Regex erase control characters except between words

3 Answers3