4

I received files which, sadly, I cannot get info about how they were generated. I need to parse these files.

The file is entirely ASCII besides for one character: 0xDB (in decimal it gives 219).

Obviously (from looking at the file) this character is a currency symbol. I know it because:

  • it is mandatory for these files to contain a currency symbol anywhere an amount appears
  • there's no other currency symbol (neither $ nor euro nor nothing) nowhere in the files
  • everytime that 0xDB appears it's next to an amount

I think that in these files that 0xDB is supposed to represent the Euro symbol (it is actually very highly probable that this 0xDB appears everywhere a Euro symbol is supposed to appear).

The file command says this about the files:

ISO-8859 English text, with CRLF, LF line terminators

An hexdump gives this:

00000030  71 75 61 6e 74 20 db 32  2e 36 30 0a 20 41 49 4d  |quant .2.60. AIM|
                            ^^                                     ^

The files are all otherwise normally formatted/parsable. Actually I'm getting all the infos fine besides for that weird 0xDB character.

Does anyone know what's going on? How did a currency symbol (supposedly the euro symbol) somehow become a 0xDB?

It's neither ISO-8859-1 (aka ISO Latin 1) nor ISO-8859-15 because in both case code point 219 corresponds to 'Û' (just as Unicode codepoint 219 is 'LATIN CAPITAL LETTER U WITH CIRCUMFLEX').

It's not extended-ASCII.

NoozNooz42
  • 4,238
  • 6
  • 33
  • 53
  • It's not any of the ISO-8859-* variants, and it's not any of the standard Microsoft code pages, either. – dkarp Jan 30 '11 at 17:14

4 Answers4

7

It could be Mac OS Roman

Jeff Ames
  • 2,044
  • 13
  • 18
  • +1... It makes perfect sense, these files at one point got processed/re-transmitted to/from a Mac computer. How did you find this? I tried to Google it a bit but couldn't find anything... – NoozNooz42 Jan 30 '11 at 17:15
  • I figured it would probably be one of the character sets listed on http://en.wikipedia.org/wiki/Western_Latin_character_sets_(computing),so it was just a matter of checking each one for 0xDB. – Jeff Ames Jan 30 '11 at 17:18
  • 3
    Since he knows what character it's meant to map to, it's easier to [just look at the character](http://www.fileformat.info/info/unicode/char/20ac/charset_support.htm) rather than to search through charsets. – dkarp Jan 30 '11 at 17:22
4

It's MacRoman. In fact it has to be -- that's the only charset in which the Euro sign maps to 0xDB.

Here's the full charset mapping for MacRoman.

dkarp
  • 14,483
  • 6
  • 58
  • 65
2

Using the macroman script, one learns:

$ macroman 0xDB
MacRoman DB  ⇒  U+20AC  ‹€›  \N{ EURO SIGN }

You can go the other way, too:

$ macroman U+00E9
MacRoman 8E  ⇐  U+00E9  ‹é›  \N{ LATIN SMALL LETTER E WITH ACUTE }

And we know that U+20AC EURO SIGN is indeed a currency symbol because of the uniprops script’s output:

$ uniprops -a U+20AC
U+20AC <€> \N{ EURO SIGN }:
    \pS \p{Sc}
    All Any Assigned InCurrencySymbols Common Zyyy Currency_Symbol Sc Currency_Symbols S Gr_Base Grapheme_Base Graph GrBase Print Symbol X_POSIX_Graph X_POSIX_Print
    Age=2.1 Bidi_Class=ET Bidi_Class=European_Terminator BC=ET Block=Currency_Symbols Canonical_Combining_Class=0 Canonical_Combining_Class=Not_Reordered CCC=NR Canonical_Combining_Class=NR Script=Common Decomposition_Type=None DT=None East_Asian_Width=A East_Asian_Width=Ambiguous EA=A Grapheme_Cluster_Break=Other GCB=XX Grapheme_Cluster_Break=XX Hangul_Syllable_Type=NA Hangul_Syllable_Type=Not_Applicable HST=NA Joining_Group=No_Joining_Group JG=NoJoiningGroup Joining_Type=Non_Joining JT=U Joining_Type=U Line_Break=PR Line_Break=Prefix_Numeric LB=PR Numeric_Type=None NT=None Numeric_Value=NaN NV=NaN Present_In=2.1 IN=2.1 Present_In=3.0 IN=3.0 Present_In=3.1 IN=3.1 Present_In=3.2 IN=3.2 Present_In=4.0 IN=4.0 Present_In=4.1 IN=4.1 Present_In=5.0 IN=5.0 Present_In=5.1 IN=5.1 Present_In=5.2 IN=5.2 Present_In=6.0 IN=6.0 SC=Zyyy Script=Zyyy Sentence_Break=Other SB=XX Sentence_Break=XX Word_Break=Other WB=XX Word_Break=XX _X_Begin
tchrist
  • 78,834
  • 30
  • 123
  • 180
1

0xDB represents the Euro sign in the Mac OS Roman character encoding.

Mormegil
  • 7,955
  • 4
  • 42
  • 77