1

I have some problems in final part of Huffman encoding.

Currently I have my coding table in cell array

code =
{
  [1,1] = 000
  [1,2] = 001
  [1,3] = 010
  [1,4] = 011
  [1,5] = 100
  ...
}

Where second index represent ascii character in my other cell array

huffman_tree =
{
  [1,1] = A
  [1,2] = B
  [1,3] = C
  [1,4] = D
  [1,5] = E
  ...
}

I'm using following code for encoding input to output:

output= [];
for i=1:length(input)
    x = findInArray(huffman_tree, input(i));
    output= [output code(x)];
end


function [index] = findInArray(array, searched)
    index = -1;
    for i=1:length(array)
        if array{i} == searched
            index = i;
        end
    end
end

At this point my code is O(n^2) or even worse. I'm having problem with large input where

length(input) = 1000000

There must be some faster way to transform input with my coding table to output.

matox
  • 885
  • 11
  • 27
  • Could you post runnable code to get `code`, because `[1,1] = 000` isn't recognized in MATLAB? Also, if you meant `(1,1) = '000'`, then `array{i} == searched` inside the function won't work. – Divakar Apr 08 '15 at 05:27
  • @Divakar elements in code array are binary numbers. Also I'm passing huffman_tree array in findInArray() function. – matox Apr 08 '15 at 06:59
  • That's fine, but the format of data in that cell array doesn't work well. The idea is to provide us with sample input data that we can copy and paste at our ends and reproduce them. – Divakar Apr 08 '15 at 07:06
  • I see. You can recreate `code` array with following command `code{i} = dec2bin(0,3);` and so on. – matox Apr 08 '15 at 07:26
  • How can we recreate `huffman_tree`? – Divakar Apr 08 '15 at 08:12
  • `huffman_tree` is cell array of unique characters from input. You can hardcode it like this `huffman_tree{1} = "A";` ... – matox Apr 08 '15 at 08:38
  • I tried that with `huffman_tree{1} = 'A';` and so on and then I used `input = code` and ran the code and threw this error - `"Undefined operator '==' for input arguments of type 'cell'.` inside the function `findInArray`. – Divakar Apr 08 '15 at 08:45
  • `input` must be something like `input = "A STRING"` Example: http://imgur.com/4naaGv4 – matox Apr 08 '15 at 08:57

1 Answers1

1

Because you're using cell arrays, that's going to be inherently slow so you have no choice but to iterate over each cell. However, I can provide some suggestions to help speed things up. What you can do is use strcmp to compare strings. I'm assuming that each character in your cell array is represented as a one character string. strcmp has the ability to take an individual string and compare itself to a cell array of strings. The output will be an array that is the same size as the cell array of strings and give you a logical true if the input string matches a position in the cell array and false otherwise.

Because your Huffman dictionary will contain a unique set of characters, you will only get one possible match per character. Therefore, we can use this logical array output to index the codebook directly to retrieve the corresponding code you want. Logical indexing works by supplying a logical vector that is the same length as the vector of interest and it retrieves those values whose corresponding positions are true. Therefore, if we only had one true value in the logical vector with the rest of the elements being false, this means that we would get just the corresponding element we desire and nothing else.

Therefore, we can change your code to do this. Note that I've changed the loop counter i to idx because it has actually been shown that using i as a loop counter slows down your code by a slight amount. See this post by Shai Bagon for more details: Using i and j as variables in Matlab. Also, I've changed the length call to numel... mainly because I don't like using length.... just a personal choice though.

output = [];
for idx = 1:numel(input)
    output = [output code(strcmp(input(idx), huffman_tree))];
end

Give the above a whirl and see if it performs any faster. For one thing, this will escape using an additional for loop for searching for a match as strcmp is very efficiently implemented, so the above code won't be O(n^2), but could be slightly better than quadratic.

Community
  • 1
  • 1
rayryeng
  • 102,964
  • 22
  • 184
  • 193
  • Thanks for great response. :) I'm going to try it in a few hours and post results. – matox Apr 08 '15 at 01:46
  • @matox - Sounds good! Let me know. I've also made some slight changes to make it faster. Please make sure you see the updated answer. You're also very welcome :) – rayryeng Apr 08 '15 at 01:48
  • 1
    Solution runs better on big input. Thanks. I'll try to use both arrays in same matrix and try with `find()` command. – matox Apr 08 '15 at 07:29
  • @matox - Sounds good. How much faster are we talking? Because you have cell arrays, you don't have a choice but to do this with some sort of loop. I've also edited the post so that I don't use `max` and I'm using only pure logical indexing. That may give you more of an increased speedup. – rayryeng Apr 08 '15 at 07:30
  • I'd say encoding is about 30% faster. But I'm not satisfied with performance, I'd rather use matrix to avoid cell arrays. – matox Apr 08 '15 at 07:39
  • 1
    @matox - I couldn't agree with you more.... but because there is wide variability in symbol length with Huffman encoding, it looks like cell arrays are the best bet for now. – rayryeng Apr 08 '15 at 07:39