149

I want to store a user's gender in a database with as little (size/performance) cost as possible.

So far, 3 scenarios come to mind

  1. Int - aligned with Enum in code (1 = Male, 2 = Female, 3 = ...)
  2. char(1) - Store m, f or another single character identifier
  3. Bit (boolean) - is there an appropriate field name for this option?

The reason I ask is because of this answer which mentions that chars are smaller than booleans.

I should clarify that I'm using MS SQL 2008, which DOES in fact have the bit datatype.

Raedwald
  • 46,613
  • 43
  • 151
  • 237
Marko
  • 71,361
  • 28
  • 124
  • 158

8 Answers8

205

There is already an ISO standard for this; no need to invent your own scheme:

http://en.wikipedia.org/wiki/ISO_5218

Per the standard, the column should be called "Sex" and the 'closest' data type would be tinyint with a CHECK constraint or lookup table as appropriate.

Pondlife
  • 15,992
  • 6
  • 37
  • 51
  • 4
    Why does it skip to 9 for 'not applicable'? What about 3-8? – Kenmore Jun 05 '15 at 06:49
  • 4
    This is for sex. OP specifically asked for gender. Sex and gender likely have different possible values that may need to be captured. – indigochild Oct 15 '15 at 20:21
  • 5
    @indigochild The OP uses both words in the question title and clearly considers them to be equivalent, at least *for his use case* (YMMV). My point is simply that an ISO standard exists in this area and you should never waste time on devising your own scheme when an official standard exists. Unless of course that standard doesn't cover your particular case, which is entirely possible. – Pondlife Oct 17 '15 at 14:54
  • 1
    This should be the accepted answer. It focuses on data integrity (which is ~forever) instead of optimization (which is situational). – Paul Cantrell Oct 31 '15 at 16:03
  • Is there a newer ISO standard with codes for gender-fluid or other non-binary gender conditions? Or are we still figuring out what to do there? Obviously you can just make up codes, but perhaps a link to some current best practices would be good. Or maybe that's going to be different for different kinds of software. – Peter Cordes May 02 '18 at 08:38
  • 2
    This should definitely be the answer. @PeterCordes this ISO is used for Sex (biological sex) and not Gender (what you identify as) - [explanation here](https://en.wikipedia.org/wiki/Sex_and_gender_distinction). I guess in the case of wanting to store the gender (which, I wouldn't know which use you have doing this), a tiny int is still good enough as long as you want to store less than 255 genders (by saying f.e. 0 = unknown/not wanting to declare, 1 = man, 2 = woman, 3 = man identifying as woman, etc.) – Solid Feb 26 '19 at 14:22
89

I'd call the column "gender".

Data Type   Bytes Taken          Number/Range of Values
------------------------------------------------
TinyINT     1                    255 (zero to 255)
INT         4            -       2,147,483,648 to 2,147,483,647
BIT         1 (2 if 9+ columns)  2 (0 and 1)
CHAR(1)     1                    26 if case insensitive, 52 otherwise

The BIT data type can be ruled out because it only supports two possible genders which is inadequate. While INT supports more than two options, it takes 4 bytes -- performance will be better with a smaller/more narrow data type.

CHAR(1) has the edge over TinyINT - both take the same number of bytes, but CHAR provides a more narrow number of values. Using CHAR(1) would make using "m", "f",etc natural keys, vs the use of numeric data which are referred to as surrogate/artificial keys. CHAR(1) is also supported on any database, should there be a need to port.

Conclusion

I would use Option 2: CHAR(1).

Addendum

An index on the gender column likely would not help because there's no value in an index on a low cardinality column. Meaning, there's not enough variety in the values for the index to provide any value.

shogged
  • 281
  • 2
  • 10
OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
  • Any reference to performance? I know it's almost micro-optimising which I shouldn't do, but it's food for my curious mind. – Marko Nov 14 '10 at 02:32
  • Thanks @OMG Ponies, what about performance? Would a char be most costly than a bit in this case? – Marko Nov 14 '10 at 02:45
  • 4
    @Marko: Like I said before, they're equal. But an index likely would *not* help because there's no value in an index on a low cardinality column. Meaning, there's not enough variety in the values for the index to provide any value. – OMG Ponies Nov 14 '10 at 02:49
  • But using char(1) will introduce the charset problem, especially when you have multi-byte fields in the same table. You have to specify different charset for the gender column. – addlistener Apr 18 '14 at 14:55
  • 1
    How much better is performance *really* going to be using, say, a 4 byte data type on a 64-bit platform? Just saying... ;-) – Craig Tullis Apr 13 '16 at 20:12
  • "If" there's a need? – shogged May 31 '18 at 10:42
45

In medicine there are four genders: male, female, indeterminate, and unknown. You mightn't need all four but you certainly need 1, 2, and 4. It's not appropriate to have a default value for this datatype. Even less to treat it as a Boolean with 'is' and 'isn't' states.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • 1
    @EJP, interesting. Do you have a reference to this? – Marko Nov 14 '10 at 02:37
  • Based on this information, I would go with `TinyInt` aligned with an enum (as Hugo suggests) and go with at least 1, 2, and 3 (Other). – IAbstract Nov 14 '10 at 04:11
  • 1
    @EJP, though your answer is probably correct, it does NOT say what datatype I should use, but rather - what the (technically) correct genders are. – Marko Nov 14 '10 at 20:56
  • @Marko: I cannot see the point of this comment. Your question didn't *ask* 'which datatype'. My answer eliminates two of your three suggestions and provides background reasoning. – user207421 Apr 15 '11 at 06:24
  • Does this argue for using a float, using NaN for unknowns? – Tom Anderson Jun 06 '11 at 22:49
  • @Tom Anderson The domain is larger than necessary ;-) – user207421 Jun 22 '11 at 10:51
  • @IAbstract 'Based on this information' you need four values, not three. – user207421 Apr 04 '12 at 23:26
  • 25
    [UK National Health Service (NHS) data dictionary](http://www.datadictionary.nhs.uk/data_dictionary/attributes/p/person/person_gender_code_de.asp?shownav=1) defines four values: 0 = `Not Known`, 1 = `Male`, 2 = `Female`, 9 = `Not Specified`, which mirrow the [ISO 5218](http://en.wikipedia.org/wiki/ISO_5218) values. Note there are [two types](http://www.datadictionary.nhs.uk/data_dictionary/attributes/p/person/person_gender_type_de.asp?query=gender&rank=75&shownav=1): gender at registration (usually shortly following birth) and current. – onedaywhen Apr 13 '12 at 14:51
3

An Int (or TinyInt) aligned to an Enum field would be my methodology.

First, if you have a single bit field in a database, the row will still use a full byte, so as far as space savings, it only pays off if you have multiple bit fields.

Second, strings/chars have a "magic value" feel to them, regardless of how obvious they may seem at design time. Not to mention, it lets people store just about any value they would not necessarily map to anything obvious.

Third, a numeric value is much easier (and better practice) to create a lookup table for, in order to enforce referential integrity, and can correlate 1-to-1 with an enum, so there is parity in storing the value in memory within the application or in the database.

Hugo
  • 1,814
  • 1
  • 15
  • 23
1

Option 3 is your best bet, but not all DB engines have a "bit" type. If you don't have a bit, then TinyINT would be your best bet.

ajacian81
  • 7,419
  • 9
  • 51
  • 64
-1

I use char 'f', 'm' and 'u' because I surmise the gender from name, voice and conversation, and sometimes don't know the gender. The final determination is their opinion.

It really depends how well you know the person and whether your criteria is physical form or personal identity. A psychologist might need additional options - cross to female, cross to male, trans to female, trans to male, hermaphrodite and undecided. With 9 options, not clearly defined by a single character, I might go with Hugo's advice of tiny integer.

zarac
  • 23
  • 1
-3
CREATE TABLE Admission (
    Rno INT PRIMARY KEY AUTO_INCREMENT,
    Name VARCHAR(25) NOT NULL,
    Gender ENUM('M','F'),
    Boolean_Valu boolean,
    Dob Date,
    Fees numeric(7,2) NOT NULL
);




insert into Admission (Name,Gender,Boolean_Valu,Dob,Fees)values('Raj','M',true,'1990-07-12',50000);
insert into Admission (Name,Gender,Boolean_Valu,Dob,Fees)values('Rani','F',false,'1994-05-10',15000);
select * from admission;

enter link description here

Jaydip Jadhav
  • 12,179
  • 6
  • 24
  • 40
-7

I would go with Option 3 but multiple NON NULLABLE bit columns instead of one. IsMale (1=Yes / 0=No) IsFemale (1=Yes / 0=No)

if requried: IsUnknownGender (1=Yes / 0=No) and so on...

This makes for easy reading of the definitions, easy extensibility, easy programmability, no possibility of using values outside the domain and no requirement of a second lookup table+FK or CHECK constraints to lock down the values.

EDIT: Correction, you do need at least one constraint to ensure the set flags are valid.

HansLindgren
  • 339
  • 2
  • 9
  • 3
    It would be nice to hear why my answer gets downvoted? – HansLindgren Jul 30 '19 at 10:54
  • Without constraints, nothing prevents all of the columns from being 1, or all of them from being 0. Which would be nonsensical, so your scheme doesn't satisfy one of your claims. – Jay Kominek Jul 11 '20 at 21:10
  • Yes, you are right that you do need one constraint to check that the correct number of flags are 'checked'. I don't think all down votes are for that omission however... – HansLindgren Jul 13 '20 at 09:48
  • It is a heavily visited question (look at the upvotes for some of the other answers!), and you came along years later and added an answer that amounts to one-hot encoding, a widely taught technique, which doesn't even have the few concrete properties you attribute to it. I don't think it was right to vote you below 0, but I'm not surprised it happened either. – Jay Kominek Jul 13 '20 at 18:27