0

I am trying to get proper character descriptions out of a legacy FAME database file. Basically this works, but the umlauts etc. are not printed correctly. Basically the following C function that is contained in the R Package FAME to this is rather a C question than an R question.

void fameWhat(int *status, int *dbkey, char **objnam, int *class,
         int *type, int *freq, int *basis, int *observ,
         int *fyear, int *fprd, int *lyear, int *lprd, 
         int *obs, int *range, 
         int * getdoc, char **desPtr, char **docPtr){
 /* Get info about an object. Note that range should be an int[3] on input */
int cyear, cmonth, cday, myear, mmonth, mday;
int i;
char fdes[256], fdoc[256];

if(*getdoc){
if(strlen(*desPtr) < 256 || strlen(*docPtr) < 256){
  *status = HBNCHR;
  return;
}
for(i = 0; i < 255; ++i) fdes[i] = fdoc[i] = ' ';
}
fdes[255] = fdoc[255] =  '\0';

cfmwhat(status, *dbkey, *objnam, class, type, freq, basis, observ,
      fyear, fprd, lyear, lprd, &cyear, &cmonth, &cday, &myear,
      &mmonth, &mday, fdes, fdoc);
if(*getdoc){
  strncpy(*desPtr, fdes, 256);
  strncpy(*docPtr, fdoc, 256);
}
if(*status == 0 && *class == HSERIE)
  cfmsrng(status, *freq, fyear, fprd, lyear, lprd, range, obs);
return;
}

I feel that due to the fact that the pointer to pointer desPtr which points to the description is of type char I do not get any proper umlauts when calling this function from R and displaying the result within an R console. I have a hunch that FAME is Latin-1 encoded. R is UTF-8. For ä I get \U3e34653c for example.

So is there a way of getting it done already in C and pass proper values to R or should I rather search and replace within R?

Note: I have seen this thread Using Unicode in C++ source code and this How to use utf8 character arrays in c++? .

Community
  • 1
  • 1
Matt Bannert
  • 27,631
  • 38
  • 141
  • 207

1 Answers1

1

It seems you have some multiple stacked encoding/decoding. How did you 'get' such a long Unicode value for a single character in the first place?

The raw hex-to-ASCII translation of that long code is either >4E< or <E4> (depending on endianness), and the latter, interpreted as a bracketed hex value, is the ä you were expecting: http://www.fileformat.info/info/unicode/char/00E4/index.htm, which is a valid Latin-1 encoding.

Converting from this coded format to UTF8 is relatively simple, although I am not sure where to paste in this code into the existing routine. As a sample standalone program:

#include <stdio.h>
#include <stdlib.h>

int main (void)
{
    char input[] = "a sm<F6>rg<E5>sbord of <code>";
    char *sourceptr, *destptr, *endptr;
    int latin1;

    sourceptr = input;
    destptr = input;
    while (*sourceptr)
    {
        if (*sourceptr == '<')
        {
            latin1 = strtol (sourceptr+1, &endptr, 16);
            if (endptr && *endptr == '>' && latin1 > 127 && latin1 <= 255)
            {
            /*  printf ("we saw hex code %xh\n", latin1); */
            /*  Quick-and-dirty converting to UTF8: */
                *destptr = (char)(0xc0 | ((latin1 & 0xc0) >> 6));
                destptr++;
                *destptr = (char)(0x80 | (latin1 & 0x3f));
                destptr++;
                sourceptr = endptr+1;
                continue;
            }
        }
        *destptr = *sourceptr;
        sourceptr++;
        destptr++;
    }
    *destptr = 0;
    printf ("output: %s\n", input);

    return 0;
}

This scans the input string for < followed by a valid hex code (assuming it's Latin-1 and so it's restricted to 80..FF) and another >. When found, it inserts the character in UTF8 format. Unrecognized sequences are copied as-is.

Jongware
  • 22,200
  • 8
  • 54
  • 100
  • +1 for the file format link. How did I get such a long string. Good question. The legacy db FAME has a C interface. So I use the function above to access the db and get a description out of it. I access the function using R which again can all C functions and give results back interactively as R is a scripting language. This works really well in general except for these umlauts. I simply don't know why I get something that `iconv` etc can't fix. – Matt Bannert Oct 22 '14 at 09:55
  • @Matt: perhaps you need to scan the input string in your function for this simple encoding and convert the found hex sequences to proper UTF8? – Jongware Oct 22 '14 at 10:01
  • by scan to mean modify the C function or rather, or rather process? Can you give an example? I am rather the data/stats/R guy here :) – Matt Bannert Oct 22 '14 at 10:16
  • @Matt: I added a basic conversion example to my answer but I actually have no idea where to insert it into your existing function. Perhaps someone else can help you with that. – Jongware Oct 22 '14 at 11:42
  • thanks a lot, think I should be able to manage that as soon as I've time to try out. I think i'll have to directly access the db using C and then make C print the output somewhere and look at the result. – Matt Bannert Oct 22 '14 at 11:54