It's a little unclear in what direction your question goes. But if you want a good relational model to store the data into, that would be three tables. One for the files. One for the states. One for the count of the states in a file. For example:
The tables:
CREATE TABLE file
(id integer
AUTO_INCREMENT,
path varchar(256)
NOT NULL,
PRIMARY KEY (id),
UNIQUE (path));
CREATE TABLE state
(id integer
AUTO_INCREMENT,
abbreviation varchar(2)
NOT NULL,
PRIMARY KEY (id),
UNIQUE (abbreviation));
CREATE TABLE occurrences
(file integer,
state integer,
count integer
NOT NULL,
PRIMARY KEY (file,
state),
FOREIGN KEY (file)
REFERENCES file
(id),
FOREIGN KEY (state)
REFERENCES state
(id),
CHECK (count >= 0));
The data:
INSERT INTO files
(path)
VALUES ('521');
INSERT INTO states
(abbreviation)
VALUES ('OH'),
('PA'),
('IN'),
('TX'),
('IL');
INSERT INTO occurrences
(file,
state,
count)
VALUES (1,
1,
4),
(1,
2,
1),
(1,
3,
2),
(1,
4,
3),
(1,
4,
7);
The states of course would be reused. Fill the table with all 50 and use them. They should not be inserted for every file again.
You can fill occurrences
explicitly with a count
of 0
for file where the respective state didn't appear, if you want to distinguish between "I know it's 0." and "I don't know the count.", which would then be encoded through the absence of a corresponding row. If you don't want to distinguish that and no row means a count of 0
, you can handle that in queries by using outer joins and coalesce()
to "translate" to 0
.