Given input, which shows tag assignments to images, as follows (reading this from php://stdin line by line, as the input can get rather large)
image_a tag_lorem
image_a tag_ipsum
image_a tag_amit
image_b tag_sit
image_b tag_dolor
image_b tag_ipsum
... (there are more lines, may get up to a million)
Output of the input is shown as follows. Basically it is the same format with another entry showing whether the image-tag combination exists in input. Note that for every image, it will list all the available tags and show whether the tag is assigned to the image by using 1/0 at the end of each line.
image_a tag_sit 0
image_a tag_lorem 1
image_a tag_dolor 0
image_a tag_ipsum 1
image_a tag_amit 1
image_b tag_sit 1
image_b tag_lorem 0
image_b tag_dolor 1
image_b tag_ipsum 1
image_b tag_amit 0
... (more)
I have posted my no-so-efficient solution down there. To give a better picture of input and output, I fed 745 rows (which explains tag assignment of 10 images) into the script via stdin, and I receive 555025 lines after the execution of the script using about 0.4MB of memory. However, it may kill the harddisk faster because of the heavy disk I/O activity (while writing/reading to the temporary column cache file).
Is there any other way of doing this? I have another script that can turn the stdin into something like this (not sure if this is useful)
image_foo tag_lorem tag_ipsum tag_amit
image_bar tag_sit tag_dolor tag_ipsum
p/s: order of tag_* is not important, but it has to be the same for all rows, i.e. this is not what i want (notice the order of tag_* is inconsistent for both tag_a and tag_b)
image_foo tag_lorem 1
image_foo tag_ipsum 1
image_foo tag_dolor 0
image_foo tag_sit 0
image_foo tag_amit 1
image_bar tag_sit 1
image_bar tag_lorem 0
image_bar tag_dolor 1
image_bar tag_ipsum 1
image_bar tag_amit 0
p/s2: I don't know the range of tag_* until i finish reading stdin
p/s3: I don't understand why I get down-voted, if clarification is needed I am more than happy to provide them, I am not trying to make fun of something or posting nonsense here. I have re-written the question again to make it sound more like a real problem (?). However, the script really doesn't have to care about what the input really is or whether database is used (well, the data is retrieved from an RDF data store if you MUST know) because I want the script to be usable for other type of data as long as the input is in right format (hence the original version of this question was very general).
p/s4: I am trying to avoid using array because I want to avoid out of memory error as much as possible (if 745 lines expaining just 10 images will be expanded into 550k lines, just imagine I have 100, 1000, or even 10000+ images).
p/s5: if you have answer in other language feel free to post it here. I have thought of solving this using clojure but still couldn't find a way to do it properly.