-1

I have two vectors vector<DataPoint> dataand vector<string> labels where DataPoint is just a vector of float: typedef vector<float> DataPoint. Each datapoint data[i] has its associated label labels[i].

Is there any way to get the label of a given datapoint x quickly ? Something like string getLabel(DataPoint x){..} which is fast.

shn
  • 5,116
  • 9
  • 34
  • 62
  • What does it mean "fast"? – ForEveR Oct 28 '13 at 11:41
  • @ForEveR I mean, not browsing the vector `data` in order to search the index `i` of `x`, and then the label is `labels[i]`. This is not fast. – shn Oct 28 '13 at 11:43
  • @shn: I think you meant: `DataPoint` is just a vector of floats, no? – lolando Oct 28 '13 at 11:45
  • @lolando yes I think it is clear from `typedef vector DataPoint` – shn Oct 28 '13 at 11:46
  • [This thread](http://stackoverflow.com/a/15652254/1655939) might be of some help: hash each vector of floats, and then insert each resulting value `hash(data[i])` associated with a `labels[i]` in an `std::unordered_map`, as in `table[hash(data[i])] = labels[i]`. This will grant you `O(1)` search. – Rubens Oct 28 '13 at 11:55

3 Answers3

2

The best you can hope to find your DataPoint's index in data is O(log(n)) complexity (using a binary search) if your data vector is sorted. Otherwise that's a linear search in O(n).

The crux of the problem is that you have two vectors that contain related data, which is always a pain to manage (and a strong hint of bad design). Best replace both vectors with a vector<LabeledDataPoints> (a structure with two members: a DataPoint and a string).

syam
  • 14,701
  • 3
  • 41
  • 65
  • sorted ? data is a vector of vectors of float, it cannot be "sorted". In python for instance if I create a dictionary associating data to labels I can search for a label of a given datapoint in O(1). I don't know how to do it in C++. – shn Oct 28 '13 at 11:47
  • @shn: That is because a dictionary in python is more like an `std::unordered_map` in C++. – Nobody moving away from SE Oct 28 '13 at 11:49
  • @shn There's absolutely no reason why it can't be sorted. Anyway, what you're describing is a `std::map` (or `unordered_map`) not a `vector`. You'll have to provide a hash function for `DataPoint`, too. Still, I stand by my advice of a single vector containing `LabeledDataPoint` structures. – syam Oct 28 '13 at 11:50
  • @syam I'm asking how to create such a map/unordered_map from my 2 vectors data and labels. – shn Oct 28 '13 at 11:51
  • 2
    @shn Then please make it clear in your question by editing it. As far as I understand it as it is, you're asking for O(1) retrieval in `vector` which is not possible. After your edit I will then delete my answer since it will have become irrelevant. – syam Oct 28 '13 at 11:53
  • 1
    syam: "a single vector containing LabeledDataPoint structures" - that's orthogonal to the issue... once the index is known, finding the label was O(1) anyway. Whether to sort the vector or move to another data structure is the meaningful issue. Anyway, O(log2N) is fast enough until proven otherwise and you've mentioned sorting, so +1 from me. – Tony Delroy Oct 28 '13 at 11:58
  • How would you sort vectors and and find what I search in O(log(n)) ? – shn Oct 28 '13 at 12:10
  • @shn You could sort first by `DataPoint`'s size then by first element of the `DataPoint` then by second element... Of course now there's the problem of having to mirror the ordering in the `labels` vector which is a headache and brings us back to a single vector of `LabeledDataPoints`. As to searching, use a binary search. – syam Oct 28 '13 at 12:19
0

A few notes: you can sort a vector with std::sort(), and search a pre-sorted vector with std::binary_search(), std::unordered_map is the C++11 hash table, std::map is a binary tree, which you could use for insertion-time sorting with O(log2N) lookup, insertion and erase. Google any of them for docs.

With your existing data structures, if dataPoint is pre-sorted, then you have O(log2N) where N is dataPoint.size(), and assuming that on average unequal dataPoints comparison need only compare the first float or two. Unsorted, it's O(N).

Clearly the performance issue is not having to look in labels after the common index is known - it's just finding out what that index is, given a dataPoint object outside the data vector.

If sorting is undesirable or O(log2N) is still too slow, you could consider putting the dataPoints into a hash table with their label.

In the unlikely case that the performance issue is only due to your dataPoints regularly starting with the same leading sequence of floats, then (assuming no trivial solution like comparing from the back of the vector towards the front) you could create some kind of hash or sum of the elements to compare first, only doing a float-by-float comparison if that's already known equal.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
0

Old Answer (it is about getting the values (DataPoint instances) easily):


Why don't you use a Map, using labels as keys and DataPoint as values(map)? In this way, you will have the data associated, and depeding on the map type, you can have differentiations on complexities(using a map, will have a lookup complexity O(logn), while a hashmap will have O(1) expected and O(n) worst case). Use what works for you better. For more information on maps and their complexities, look here too: multiset, map and hash map complexity


UPDATE:

To get the label for each DataPoint, one idea is to create a separate class (for example DataContainer) that contains as private members your vector of DataPoint instances and a string that contains your label with the appropriate setters/getters.

class DataContainer{
  private:
    DataPoint mDataPoint;
    string mLabel;

  public:
    DataContainer(DataPoint dataPoint,string label): 
      mDataPoint(dataPoint), mLabel(label){}

    void setDataPoint(DataPoint dataPoint){
      mDataPoint = dataPoint;
    }

    void setLabel(string label){
      mLabel = label;
    }

    DataPoint getDataPoint(){
      return mDataPoint;
    }

    //This getter does the job, with O(1) complexity.
    string getLabel(){
      return mLabel;
    }
  }

This way, you can put your DataContainer in any structure you want (and I suggest map in the case you want to get the keys similarily: map), setting the label on instantiation and getting it with the getter method with O(1) complexity. As you can see, your question needs to be approached differently, and there are some ways to do it.

Community
  • 1
  • 1
Nick Louloudakis
  • 5,856
  • 4
  • 41
  • 54
  • I have `vector< vector > data`, and `vector label;` A given user send me one data: `vector x`, and I should return the corresponding label to him. So Map cannot hold. A `unordered_map, int>` can be convenient, but I don't use C++11 – shn Oct 28 '13 at 12:09
  • or just map, string> will work I guess. – shn Oct 28 '13 at 12:15
  • The way maps are used is map. So, you can do map>. The API of map in STL gives you the opportunity to retrieve values using key with get method I think. http://www.cplusplus.com/reference/map/map/ – Nick Louloudakis Oct 28 '13 at 12:29
  • by retrieving values you will browse them to search for the value you want and get the corresponding string key. Same problem .. – shn Oct 28 '13 at 12:35
  • Answer Updated. Please check it. – Nick Louloudakis Oct 28 '13 at 14:12