1

I've created a template function defined as

template < typename _Iter8, typename _Iter32 >
int Utf8toUtf32 ( const _Iter8 & _from, const _Iter8 & _from_end, _Iter32 & _dest, const _Iter32 & _dest_end );

Edited: first parameter to be a const type.

The first and third parameters change to reflect their new position. The second and fourth parameters mark the upper boundary of iteration.

I'm hoping to implement a 'one functions fits all' logic. The only stipulation is that the both _Iter types are of the same type and are dereferenceable. Also I want the template parameters to be deducable.

The first problem I encountered was

char utf8String [] "...some utf8 string ...";
wchar_t wideString [ 100 ];
char * pIter = utfString;
Utf8toUtf16( pIter, pIter + n, wideString, wideString + 100 );

The _Iter16 is ambiguous. I'm guessing because the compiler sees the third parameter as a wchar_t[ 100 ] type and the fourth as a wchar_t* type. Correct me if I'm wrong. Changing the code to:

Utf8toUtf16( pIter, pIter + n, (wchar_t*)wideString, wideString + 100 );

Fixes the problem. Ugly but works.

Then I hit another problem:

unsigned long nCodepoint;
Utf8toUtf32( pIter, pIter + n, &nCodepoint, &nCodepoint + 1 ));

Obviously, if I changed nCodepoint to be an array type and applied the same cast as the first, it would compile.

I'm not sure if I defined the template parameters wrong. My question is how do I correctly code this given my constraints above, and is there a way to do it without resorting to casts?

Edit: As Jogojapan and DyP pointed out below, the above cast shouldn't compile. I should have instead created a new pointer to the front of the array and passed that in. As for the nCodepoint, I may have to create it as a length 1 array.

Twifty
  • 3,267
  • 1
  • 29
  • 54
  • 2
    As an aside: names like `_Iter8` and `_Iter32` (leading underscore, followed by an underscore or a capital letter) are reserved, so your code has undefined behavior. – Jerry Coffin Jul 11 '13 at 06:40
  • C++11? Use [`std::begin`](http://en.cppreference.com/w/cpp/iterator/begin) and [`std::end`](http://en.cppreference.com/w/cpp/iterator/end) (even for arrays). Not C++11? Write your own. – dyp Jul 11 '13 at 06:43
  • 2
    The second problem is caused by the fact that you pass your iterators as references. Is there a reason why you do this? Passing iterators by copy should be fine. – jogojapan Jul 11 '13 at 06:43
  • 1
    @jogojapan I think the first problem is caused by passing by ref, too. If the template parameters were no references, the array type should decay to a pointer. – dyp Jul 11 '13 at 06:45
  • @JerryCoffin are they really? I would have thought that they were restricted to the scope of the method/class. – Twifty Jul 11 '13 at 06:45
  • @DyP I didn't tag the question as c++11, also since begin() and end() both return the same type of iterator there is no ambiguity. Passing by reference because the _from iterator will be changed. Sure the _from_end can be passed by copy. – Twifty Jul 11 '13 at 06:46
  • 2
    @Waldermort: leading underscore followed by lower case is okay within the scope of a class, but leading underscore followed by upper case is always reserved for the implementation. Most people consider it better to just avoid leading underscores entirely (and I tend to agree). – Jerry Coffin Jul 11 '13 at 06:47
  • @DyP Actually, yes, coming to think of it. I think both problems are caused by the references. – jogojapan Jul 11 '13 at 06:48
  • @JerryCoffin Learn something new everyday, as the saying goes. Thanks for pointing that out. – Twifty Jul 11 '13 at 06:50
  • @jogojapan ahh, sorry. – Twifty Jul 11 '13 at 06:50
  • To elaborate on @JerryCoffin's comment, [reserved identifier](http://stackoverflow.com/questions/228783/what-are-the-rules-about-using-an-underscore-in-a-c-identifier). That lists the few specific rules. – chris Jul 11 '13 at 06:50
  • @Waldermort If you want to make `_from_end` refer to another element, you cannot pass an array, but had to create another pointer to pass to your function. If you want to change the object `_from_end` refers to, don't take a ref but by-value (just like with pointers). – dyp Jul 11 '13 at 06:54
  • @jogojapan changing the second and fourth to pass by copy makes no difference. Still getting an error on third parameter not being convertable from unsinged long* to unsigned long*& – Twifty Jul 11 '13 at 06:55
  • @Waldermort, What makes the first and third special? – chris Jul 11 '13 at 06:57
  • @Waldermort Yes, I meant changing all to pass-by-copy. Why the second and fourth only? (In the second problem you describe, `&nCodepoint` is the address of a local variable (i.e. an rvalue); you can't pass that by reference.) – jogojapan Jul 11 '13 at 06:58
  • @DyP I'm not trying to pass an array. In my first example, I noted that the compiler thinks I'm trying to pass an array but casting it to a pointer fixes the problem. – Twifty Jul 11 '13 at 07:00
  • @jogojapan The first and third will be updated to reflect there new position, so must be passed by reference. The second and fourth only mark the upper boundary of the iteration and will not change or even be derefernced. – Twifty Jul 11 '13 at 07:02
  • Then again, you might have reasons why you pass them as references. Maybe to deal with errors during the UTF8->UTF32 conversion: When an error occurs, the function returns and `pIter` points to the position after the last character that could be successfully converted, etc. In that case you may want to keep the references, but then you just simply cannot pass `&nCodepoint` as argument. – jogojapan Jul 11 '13 at 07:02
  • @chris edited post to point that out. – Twifty Jul 11 '13 at 07:06
  • @jogojapan ok, changing that nCodepoint to pCodepoint[1] is not too big of a problem since it's only used inside the Utf8toUtf16 function. I'm still left with the ugly casts though. – Twifty Jul 11 '13 at 07:08
  • 1
    The cast to `wchar_t*` should NOT solve the problem IMO. Casting an array type to a pointer type yields an prvalue, and you cannot initialize a non-const lvalue ref (of non-class type) with a prvalue. – dyp Jul 11 '13 at 07:10
  • @DyP exactly! I want to not use the cast but also ensure that both types are the same. Passing by pointer is not an option here since it also needs to handle std::iterators. – Twifty Jul 11 '13 at 07:13
  • 1
    @Waldermort The problem isn't that both types aren't the same. The problem is that first of the two (i.e. your third argument) is not modifiable. You cannot modify the address of a local array. I'm surprised your compiler accepts the cast-based solution (GCC does not). – jogojapan Jul 11 '13 at 07:14
  • 1
    @Waldermort I actually meant "Fixes the problem. Ugly but works." <- it should NOT work, as far as I understand the Standard (and as far as I could test it). It's weird that it works for your case. I think the best solution would be *not* to modify the iterators, but to return a set of new iterators. – dyp Jul 11 '13 at 07:15
  • 1
    Ok, I'm seeing what you're both saying about the third argument. I agree my compiler, vs2010, should have pointed that out. But it's not good with C style casts. Let me correct that and try again. – Twifty Jul 11 '13 at 07:20
  • @jogojapan Write that up as an answer and I'll give you the credit. – Twifty Jul 11 '13 at 07:25
  • @Waldermort I've added it to the community-owned answer. (It's not really a solution -- just pointing out the fact that reference-passing _and_ passing rvalues cannot be combined in one solution.) – jogojapan Jul 11 '13 at 07:37
  • @jogojapan ok. Thanks for the help everybody. Now it's working as should. – Twifty Jul 11 '13 at 07:45
  • @Waldermort I've updated the answer once more with a possible solution based on overloading the template (in case you haven't tried that yet). This will work regardless of whether you pass rvalues or lvalues. In case of rvalues, it will obviously not modify the original, though. – jogojapan Jul 11 '13 at 08:10
  • @jogojapan Thanks again. Some really great alternatives there. – Twifty Jul 11 '13 at 08:29

1 Answers1

3

As jogojapan actually gave the answer, I'll make this a community wiki.

IMO, this is an adequate solution:

template < typename Iter8, typename Iter32 >
Iter32 Utf8toUtf32(Iter8 _from, Iter8 _from_end, Iter32 _dest, Iter32 _dest_end);

This is intended to return what you wanted _dest to change to.

If you really also need to return an int, you could return a pair.

To reflect which iterators are to be read from, and which are to be written to, you could use a naming scheme for the template parameters, e.g. InputIterator8 and OutputIterator32.


To give an analogy from a function of the Standard Library:

std::vector<int> v = {1,2,3,4};
for(auto i = v.begin(); i != v.end();)
{
    if(*i == 2)
    {
        i = v.erase(i);  // iterator invalidated and new "next" iterator returned
    }
}

If you want your function a) to accept arrays and b) to be similar to Standard Library functions, I don't see any other way but to return the "changed" iterators. The only Library function I know that actually changes the iterator passed is std::advance.

Example:

template < typename Iter8, typename Iter32 >
std::tuple<int, Iter8, Iter32> Utf8toUtf32(Iter8 _from, Iter8 _from_end,
                                           Iter32 _dest, Iter32 _dest_end);

char utf8String [] = "...some utf8 string ...";
wchar_t wideString [ 100 ];
char* pUtf8Res = nullptr;
wchar_t* pUtf16Res = nullptr;
int res = 0;
std::tie(res, pUtf8Res, pUtf16Res) = Utf8toUtf16( begin(pIter), end(pIter),
                                         begin(wideString), end(wideString) );

(Edit by jogojapan)

If you must keep passing the iterators as references because you want to update the text position they are pointing at, both problems described in the question cannot be solved directly.

Problem 1: Passing wideString, which is a local array, to a function will mean its type decays to a wchar_t* rvalue, and that cannot be bound to a wchar_t *& non-const reference. In other words, you cannot have a function modify the address of a local array. Casting it to pointer does not change that fact, and the compiler is wrong when it accepts that solution.

Problem 2: Similarly, passing the address of nCodepoint by reference is impossible, as that address cannot be changed. The only solution is to store the address in a separate pointer first, and then pass that:

unsigned long *pCodepoint = &nCodepoint;
Utf8toUtf32(pIter,PIter+5,pCodepoint,pCodepoint+1);

(Another edit by jogojapan)

If you want to pass by reference, but you want to make the function flexible enough to accept non-reference parameters as well, you can actually provide overloaded definitions of the template:

/* Using C++11 code for convenience. Rewriting in C++03 is easy. */
#include <type_traits>

template <typename T>
using noref = typename std::remove_reference<T>::type;

template <typename Iter8, typename Iter32>
int Utf8toUtf32 (Iter8 &from, const Iter8 from_end, Iter32 &dest, const Iter32 dest_end)
{
  return 0;
}

template <typename Iter8, typename Iter32>
int Utf8toUtf32 (Iter8 &from, const Iter8 from_end, noref<Iter32> dest, const Iter32 dest_end)
{
  noref<Iter32> p_dest = dest;
  return Utf8toUtf32(from,from_end,p_dest,dest_end);
}

template <typename Iter8, typename Iter32>
int Utf8toUtf32 (noref<Iter8> from, const Iter8 from_end, Iter32 &dest, const Iter32 dest_end)
{
  noref<Iter8> p_from = from;
  return Utf8toUtf32(p_from,from_end,dest,dest_end);
}

template <typename Iter8, typename Iter32>
int Utf8toUtf32 (noref<Iter8> from, const Iter8 from_end, noref<Iter32> dest, const Iter32 dest_end)
{
  noref<Iter8>  p_from = from;
  noref<Iter32> p_dest = dest;
  return Utf8toUtf32(p_from,from_end,p_dest,dest_end);
}

You can then call this with all kinds of combinations of lvalues and rvalues:

int main()
{
  char input[]        = "hello";
  const char *p_input = input;
  unsigned long dest;
  unsigned long *p_dest = &dest;
  std::string input_str("hello");

  Utf8toUtf32(input,input+5,&dest,&dest+1);
  Utf8toUtf32(p_input,p_input+5,&dest,&dest+1);

  Utf8toUtf32(input,input+5,p_dest,p_dest+1);
  Utf8toUtf32(p_input,p_input+5,p_dest,p_dest+1);

  Utf8toUtf32(begin(input_str),end(input_str),p_dest,p_dest+1);
  Utf8toUtf32(begin(input_str),end(input_str),&dest,&dest+1);

  return 0;
}

But be warned: When passing an rvalue (such as an array or an expression like &local_var), the call will work and there will be no undefined behaviour, but of course the address of the local variable or array will of course still not change. So the caller won't, in this situation, be able to find out how many characters the function was able to process.

Community
  • 1
  • 1
dyp
  • 38,334
  • 13
  • 112
  • 177
  • As earlier noted, `_from` and `_dest` MUST be passed by reference. – Twifty Jul 11 '13 at 07:24
  • @Waldermort Why? If you want to notify the user of the function at which position the function has ended, you could return those as iterators. I don't see a reason to change to arguments passed to the function. – dyp Jul 11 '13 at 07:24
  • please read above comments. Both `_dest` and `from` are updated, there is no point returning them as a pair as well as returning my int return type. – Twifty Jul 11 '13 at 07:28
  • 1
    @DyP `const &` causes ambiguities, because the compiler may deduce the template parameter such that it includes `const` as part of the deduced type. Not sure how `&&` could work; as you say, it matches always, so it can't be used for disambiguation. – jogojapan Jul 11 '13 at 08:20
  • @jogojapan Oops, of course that would reintroduce the original problem if `std::begin` etc would not be used explicitly. I think/thought `&&` could be used as pointer prvalues are bound to rvalue refs via temporaries, and changing those doesn't have ill effects. – dyp Jul 11 '13 at 08:58