Manipulating data to matrix-like format in PHP

Question

Given input, which shows tag assignments to images, as follows (reading this from php://stdin line by line, as the input can get rather large)

image_a tag_lorem
image_a tag_ipsum
image_a tag_amit
image_b tag_sit
image_b tag_dolor
image_b tag_ipsum
... (there are more lines, may get up to a million)

Output of the input is shown as follows. Basically it is the same format with another entry showing whether the image-tag combination exists in input. Note that for every image, it will list all the available tags and show whether the tag is assigned to the image by using 1/0 at the end of each line.

image_a tag_sit 0
image_a tag_lorem 1
image_a tag_dolor 0
image_a tag_ipsum 1
image_a tag_amit 1
image_b tag_sit 1
image_b tag_lorem 0
image_b tag_dolor 1
image_b tag_ipsum 1
image_b tag_amit 0
... (more)

I have posted my no-so-efficient solution down there. To give a better picture of input and output, I fed 745 rows (which explains tag assignment of 10 images) into the script via stdin, and I receive 555025 lines after the execution of the script using about 0.4MB of memory. However, it may kill the harddisk faster because of the heavy disk I/O activity (while writing/reading to the temporary column cache file).

Is there any other way of doing this? I have another script that can turn the stdin into something like this (not sure if this is useful)

image_foo tag_lorem tag_ipsum tag_amit
image_bar tag_sit tag_dolor tag_ipsum

p/s: order of tag_* is not important, but it has to be the same for all rows, i.e. this is not what i want (notice the order of tag_* is inconsistent for both tag_a and tag_b)

image_foo tag_lorem 1
image_foo tag_ipsum 1
image_foo tag_dolor 0
image_foo tag_sit 0
image_foo tag_amit 1
image_bar tag_sit 1
image_bar tag_lorem 0
image_bar tag_dolor 1
image_bar tag_ipsum 1
image_bar tag_amit 0

p/s2: I don't know the range of tag_* until i finish reading stdin

p/s3: I don't understand why I get down-voted, if clarification is needed I am more than happy to provide them, I am not trying to make fun of something or posting nonsense here. I have re-written the question again to make it sound more like a real problem (?). However, the script really doesn't have to care about what the input really is or whether database is used (well, the data is retrieved from an RDF data store if you MUST know) because I want the script to be usable for other type of data as long as the input is in right format (hence the original version of this question was very general).

p/s4: I am trying to avoid using array because I want to avoid out of memory error as much as possible (if 745 lines expaining just 10 images will be expanded into 550k lines, just imagine I have 100, 1000, or even 10000+ images).

p/s5: if you have answer in other language feel free to post it here. I have thought of solving this using clojure but still couldn't find a way to do it properly.

Do you have to use STDIN, or can you pass a filename instead? With two passes over the data you could use the first to determine the number of rows and columns, and the second to match row/column instances. I presume that you don't want to read the whole file in at once? Also, is the input always ordered by row & column? — Mike, Aug 16 '11 at 08:31
Will the data only ever comprise rows a-z and columns a-z, or can the values be different? You say that the input can get rather large, but 26 rows by 26 columns isn't all that large. — Mike, Aug 16 '11 at 08:40
yea, is trying not to read the whole file in at once to avoid potential out of memory error :/ — Jeffrey04, Aug 16 '11 at 08:41
row_a -> row_z, or col_a -> col_z is just an arbitrary name, they are actually both URIs that are completely random (hence I don't mind the order, but they have to be consistent). — Jeffrey04, Aug 16 '11 at 08:43
So is it okay to pass a filename and open that file, rather than reading STDIN? The file can still be read line by line. And how do you determine what column a particular random URL should be in? 'column_b' can be easily mapped to column 2, but what determines the column for a URL? — Mike, Aug 16 '11 at 08:48
@Mike 1. it is ok, but doesn't make much difference, does it? 2. the order of the url in each input determines whether it is a row / column (in "uri_1 uri_2" order) — Jeffrey04, Aug 16 '11 at 08:55
@Mike let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/2520/discussion-between-jeffrey04-and-mike) — Jeffrey04, Aug 16 '11 at 08:56
Do you have database avaliable? It will be much more simple/faster (even with SQL lite). — XzKto, Aug 16 '11 at 12:01
the data is actually from database (it's rdf based data store, I did some work to format them into the form of input above) — Jeffrey04, Aug 16 '11 at 13:38
That is not answer to my question. Do you have MySql database or something? It will be really easy/fast/low memory usage to do it with MySql. — XzKto, Aug 16 '11 at 14:03
rdf storage engine is using MySQL as backend, but I don't want to care about the database backend at all. All I need the script to do is, read the friggin input and vomit the output, that's all. — Jeffrey04, Aug 16 '11 at 14:35
Heh, no need to get agressive. I was asking if we could use MySql in script that parses data, not how you got that data. If that is possible then we could just insert all data that we need in MySql tables to avoid large PHP arrays or using temporary files. Ok, enough work time spent on this - I may provide a better solution tomorrow. — XzKto, Aug 16 '11 at 15:04
I agree with @XzKto, and was thinking just the same thing. An RDMS such as SQLite or MySQL would handle the data with ease. It should be possible to come up with a SELECT statement that produces the output you require. I experimented and came close, but ran out of time. If someone doesn't beat me to it, I'll have another go when I get chance. — Mike, Aug 17 '11 at 14:46

XzKto · Accepted Answer · 2011-08-17T10:17:02.337

Sorry, maby I misunderstood you - this looks too easy:

$stdin = fopen('php://stdin', 'r');
$columns_arr=array();
$rows_arr=array();
function set_empty_vals(&$value,$key,$columns_arr) {
    $value=array_merge($columns_arr,$value);
    ksort($value);
    foreach($value AS $val_name => $flag) {
        echo $key.' '.$val_name.' '.$flag.PHP_EOL;
    }
    $value=NULL;
}
while ($line = fgets($stdin)) {
    $line=trim($line);
    list($row,$column)=explode(' ',$line);
    $row=trim($row);
    $colum=trim($column);
    if(!isset($rows_arr[$row]))
        $rows_arr[$row]=array();
    $rows_arr[$row][$column]=1;
    $columns_arr[$column]=0;
}
array_walk($rows_arr,'set_empty_vals',$columns_arr);

UPD:

1 million lines is easy for php:

$columns_arr = array();
$rows_arr = array();

function set_null_arr(&$value, $key, $columns_arr) {
    $value = array_merge($columns_arr, $value);
    ksort($value);
    foreach($value AS $val_name => $flag) {
        //echo $key.' '.$val_name.' '.$flag.PHP_EOL;
    }
    $value=NULL;
}

for ($i = 0; $i < 100000; $i++) {
    for ($j = 0; $j < 10; $j++) {
        $row='row_foo'.$i;
        $column='column_ipsum'.$j;
        if (!isset($rows_arr[$row]))
            $rows_arr[$row] = array();
        $rows_arr[$row][$column] = 1;
        $columns_arr[$column] = 0;
    }
}
array_walk($rows_arr, 'set_null_arr', $columns_arr);

echo memory_get_peak_usage();

147Mb for me.

Last UPD - this is how I see low memory usage(but rather fast) script:

//Approximate stdin buffer size, 1Mb should be good
define('MY_STDIN_READ_BUFF_LEN', 1048576);
//Approximate tmpfile buffer size, 1Mb should be good
define('MY_TMPFILE_READ_BUFF_LEN', 1048576);
//Custom stdin line delimiter(\r\n, \n, \r etc.)
define('MY_STDIN_LINE_DELIM', PHP_EOL);
//Custom stmfile line delimiter - chose smallset possible
define('MY_TMPFILE_LINE_DELIM', "\n");
//Custom stmfile line delimiter - chose smallset possible
define('MY_OUTPUT_LINE_DELIM', "\n");

function my_output_arr($field_name,$columns_data) {
    ksort($columns_data);
    foreach($columns_data AS $column_name => $column_flag) {
        echo $field_name.' '.$column_name.' '.$column_flag.MY_OUTPUT_LINE_DELIM;
    }
}

$tmpfile=tmpfile() OR die('Can\'t create/open temporary file!');
$buffer_len = 0;
$buffer='';
//I don't think there is a point to save columns array in file -
//it should be small enough to hold in memory.
$columns_array=array();

//Open stdin for reading
$stdin = fopen('php://stdin', 'r') OR die('Failed to open stdin!');

//Main stdin reading and tmp file writing loop
//Using fread + explode + big buffer showed great performance boost
//in comparison with fgets();
while ($read_buffer = fread($stdin, MY_STDIN_READ_BUFF_LEN)) {
    $lines_arr=explode(MY_STDIN_LINE_DELIM,$buffer.$read_buffer);
    $read_buffer='';
    $lines_arr_size=count($lines_arr)-1;
    $buffer=$lines_arr[$lines_arr_size];
    for($i=0;$i<$lines_arr_size;$i++) {
        $line=trim($lines_arr[$i]);
        //There must be a space in each line - we break in it
        if(!strpos($line,' '))
            continue;
        list($row,$column)=explode(' ',$line,2);
        $columns_array[$column]=0;
        //Save line in temporary file
        fwrite($tmpfile,$row.' '.$column.MY_TMPFILE_LINE_DELIM);
    }
}
fseek($tmpfile,0);

$cur_row=NULL;
$row_data=array();
while ($read_buffer = fread($tmpfile, MY_TMPFILE_READ_BUFF_LEN)) {
    $lines_arr=explode(MY_TMPFILE_LINE_DELIM,$buffer.$read_buffer);
    $read_buffer='';
    $lines_arr_size=count($lines_arr)-1;
    $buffer=$lines_arr[$lines_arr_size];
    for($i=0;$i<$lines_arr_size;$i++) {
        list($row,$column)=explode(' ',$lines_arr[$i],2);
        if($row!==$cur_row) {
            //Output array
            if($cur_row!==NULL)
                my_output_arr($cur_row,array_merge($columns_array,$row_data));
            $cur_row=$row;
            $row_data=array();
        }
        $row_data[$column]=1;
    }
}

if(count($row_data)&&$cur_row!==NULL) {
    my_output_arr($cur_row,array_merge($columns_array,$row_data));
}

of course it is easy if i only have that much of data :) the data is basically from STDIN (please read my poorly implemented code if you want to know more), and the amount of data can be from 10 rows -> 1mil rows of data. Chances are the amount of data will kill the script if I am not careful enough. — Jeffrey04, Aug 16 '11 at 11:38
Can you just increase memory_limit setting? How big your data is and is it avaliable in file(not stdin)? How many different columns may be there (can columns array cause OOM)? Please update your main post to make it all more clear - we are not psychics. — XzKto, Aug 16 '11 at 11:46
memory_limit is currently set to -1, the data can get as big as 1Mil lines, just assume the input is in the fixed format: row col — Jeffrey04, Aug 16 '11 at 13:37
I updated my post to show you that PHP is not scared by 1 million lines. Or do you insist on cache using? Add it to the question. — XzKto, Aug 16 '11 at 14:01
i am caching the input to some temp files (please read my answer), i have not check how much memory it consumes though (I am only using a sample of ~800 lines), will let you know later. — Jeffrey04, Aug 16 '11 at 14:05
input 745 lines => output 555025 lines, memory usage ~0.4MB using my solution, i can't use your example code directly (because you are not reading from stdin/file) so I can't compare — Jeffrey04, Aug 16 '11 at 14:11
assuming the answer is correct, memory usage is about 0.74MB for the same set of input, I will try with bigger sample later tonight to see how scalable it is — Jeffrey04, Aug 16 '11 at 14:39

score 1 · Answer 2 · answered Aug 17 '11 at 16:33

Here's a MySQL example that works with your supplied test data:

CREATE TABLE `url` (
  `url1` varchar(255) DEFAULT NULL,
  `url2` varchar(255) DEFAULT NULL,
  KEY `url1` (`url1`),
  KEY `url2` (`url2`)
);

INSERT INTO url (url1, url2) VALUES
('image_a', 'tag_lorem'),
('image_a', 'tag_ipsum'),
('image_a', 'tag_amit'),
('image_b', 'tag_sit'),
('image_b', 'tag_dolor'),
('image_b', 'tag_ipsum');

  SELECT url1, url2, assigned FROM (
  SELECT t1.url1, t1.url2, 1 AS assigned
    FROM url t1
   UNION
  SELECT t1.url1, t2.url2, 0 AS assigned
    FROM url t1
    JOIN url t2
      ON t1.url1 != t2.url1
    JOIN url t3
      ON t1.url1 != t3.url1
     AND t1.url2  = t3.url2
     AND t2.url2 != t3.url2 ) tmp
ORDER BY url1, url2;

Result:

+---------+-----------+----------+
| url1    | url2      | assigned |
+---------+-----------+----------+
| image_a | tag_amit  |        1 |
| image_a | tag_dolor |        0 |
| image_a | tag_ipsum |        1 |
| image_a | tag_lorem |        1 |
| image_a | tag_sit   |        0 |
| image_b | tag_amit  |        0 |
| image_b | tag_dolor |        1 |
| image_b | tag_ipsum |        1 |
| image_b | tag_lorem |        0 |
| image_b | tag_sit   |        1 |
+---------+-----------+----------+

This should be simple enough to convert to SQLite, so if required you could use PHP to read the data into a temporary SQLite database, and then extract the results.

score 0 · Answer 3 · answered Aug 16 '11 at 08:25

0

Put your input data in array and then sort them by using usort, define comparison function which compares array elements by row values and then column values if row values are equal.

answered Aug 16 '11 at 08:25

codez

1,381
1
18
28

i know i am doing premature optimization here, but if the input is sufficiently large, wouldn't it kill the script (given memory is limited)? – Jeffrey04 Aug 16 '11 at 08:40
I doubt you can do any better sort *with php*, since that is native function and hence works faster and takes less memory. – Smar Aug 16 '11 at 09:24
well, the answer is not really relevant to the question, order of the columns is not important, but they have to be consistent (I have posted an inefficient version of the script, please refer if you are interested). Anyway, thanks for the help – Jeffrey04 Aug 16 '11 at 14:00

Jeffrey04 · Answer 4 · 2011-08-18T09:03:25.520

This is my current implementation, I don't like it, but it does the job for now.

#!/usr/bin/env php
<?php

define('CACHE_MATCH', 0);
define('CACHE_COLUMN', 1);

define('INPUT_ROW', 0);
define('INPUT_COLUMN', 1);
define('INPUT_COUNT', 2);

output_expanded_entries(
    cache_input(array(tmpfile(), tmpfile()), STDIN, fgets(STDIN))
);
echo memory_get_peak_usage();

function cache_input(Array $cache_files, $input_pointer, $input) {
    if(count($cache_files) != 2) {
        throw new Exception('$cache_files requires 2 file pointers');
    }

    if(feof($input_pointer) == FALSE) {
        cache_match($cache_files[CACHE_MATCH], trim($input));
        cache_column($cache_files[CACHE_COLUMN], process_line($input));

        cache_input(
            $cache_files,
            $input_pointer,
            fgets($input_pointer)
        );
    }

    return $cache_files;
}

function cache_column($cache_column, $input) {
    if(empty($input) === FALSE) {
        rewind($cache_column);

        $column = get_field($input, INPUT_COLUMN);

        if(column_cached_in_memory($column) === FALSE && column_cached_in_file($cache_column, fgets($cache_column), $column) === FALSE) {
            fputs($cache_column, $column . PHP_EOL);
        }
    }
}

function cache_match($cache_match, $input) {
    if(empty($input) === FALSE) {
        fputs($cache_match, $input . PHP_EOL);
    }
}

function column_cached_in_file($cache_column, $current, $column, $result = FALSE) {
    return $result === FALSE && feof($cache_column) === FALSE ?
        column_cached_in_file($cache_column, fgets($cache_column), $column, $column == $current)
        : $result;
}

function column_cached_in_memory($column) {
    static $local_cache = array(), $index = 0, $count = 500;

    $result = TRUE;

    if(in_array($column, $local_cache) === FALSE) {
        $result = FALSE;

        $local_cache[$index++ % $count] = $column;
    }

    return $result;
}

function output_expanded_entries(Array $cache_files) {
    array_map('rewind', $cache_files);

    for($current_row = NULL, $cache = array(); feof($cache_files[CACHE_MATCH]) === FALSE;) {
        $input = process_line(fgets($cache_files[CACHE_MATCH]));

        if(empty($input) === FALSE) {
            if($current_row !== get_field($input, INPUT_ROW)) {
                output_cache($current_row, $cache);

                $cache = read_columns($cache_files[CACHE_COLUMN]);
                $current_row = get_field($input, INPUT_ROW);
            }

            $cache = array_merge(
                $cache,
                array(get_field($input, INPUT_COLUMN) => get_field($input, INPUT_COUNT))
            );
        }
    }

    output_cache($current_row, $cache);
}

function output_cache($row, $column_count_list) {
    if(count($column_count_list) != 0) {
        printf(
            '%s %s %s%s',
            $row,
            key(array_slice($column_count_list, 0, 1)),
            current(array_slice($column_count_list, 0, 1)),
            PHP_EOL
        );

        output_cache($row, array_slice($column_count_list, 1));
    }
}

function get_field(Array $input, $field) {
    $result = NULL;

    if(in_array($field, array_keys($input))) {
        $result = $input[$field];
    } elseif($field == INPUT_COUNT) {
        $result = 1;
    }

    return $result;
}

function process_line($input) {
    $result = trim($input);

    return empty($result) === FALSE && strpos($result, ' ') !== FALSE ?
        explode(' ', $result)
        : NULL;
}

function push_column($input, Array $result) {
    return empty($input) === FALSE && is_array($input) ?
        array_merge(
            $result,
            array(get_field($input, INPUT_COLUMN))
        )
        : $result;
}

function read_columns($cache_columns) {
    rewind($cache_columns);

    $result = array();

    while(feof($cache_columns) === FALSE) {
        $column = trim(fgets($cache_columns));

        if(empty($column) === FALSE) {
            $result[$column] = 0;
        }
    }

    return $result;
}

EDIT: yesterday's version was bugged :/

You should use loops instead of recursion - this is bad practise and uses much more memory and CPU then neccesary. And you shouldn't try to place all your logic in one line - it is much harder to read this way, while you gain only marginal increase in speed. That lonely exeption looks strange too. Btw, did you test my code or should I just leave this thread? "Premature optimization is the root of all evil." — XzKto, Aug 18 '11 at 13:21
yea, i tried, i am only a couple of megabytes more efficient than your version, would probably just use yours. (yes, premature optimization = root of all evil) — Jeffrey04, Aug 22 '11 at 06:38

Manipulating data to matrix-like format in PHP

4 Answers4

Linked