-1

I have written Perl code that parses a text file and uses a hash to tally up the number of times a US State abbreviation appears in each file/record. I end up with something like this.

File: 521
OH => 4
PA => 1
IN => 2
TX => 3
IL => 7

I am struggling to find a way to store such hash results in an SQL database. I am using mariadb. Because the structure of the data itself varies, one file will have some states and the next may have others. For example, one file may contain only a few states, the next may contain a group of completely different states. I am even having trouble conceptualizing the table structure. What would be the best way to store data like this in a database?

gatorreina
  • 864
  • 5
  • 14
  • 1
    [`DBIx::Class`](https://metacpan.org/dist/DBIx-Class/view/lib/DBIx/Class.pod) or another ORM if you're looking for programmer efficiency? – Shawn Jan 16 '22 at 15:58
  • 1
    https://metacpan.org/pod/Storable. Wrappers are available. – Holli Jan 16 '22 at 16:00
  • Maybe use the [JSON data type](https://dev.mysql.com/doc/refman/8.0/en/json.html) ? – Håkon Hægland Jan 16 '22 at 16:02
  • 3
    What is `File: 521`? Is `521` a filename or some kind id? What kind of manipulation you would do on this data? There are many databases available for free. Sometimes it does not requite DB of full implementation for simple task. If you would like to store/read hash data then [JSON](https://en.wikipedia.org/wiki/JSON) or [YAML](https://en.wikipedia.org/wiki/YAML) may be sufficient. If operations are very basic then [SQLite](https://www.sqlite.org/) might be a good option. – Polar Bear Jan 16 '22 at 20:09
  • 521 is an id. I think JSON and YAML are ideal due to the structure of the date but I do not know of a way to remotely save and retrieve that data. The server is a DB server and nothing else. – gatorreina Jan 17 '22 at 00:09
  • 1
    Please clarify via edits, not comments. PS This is not clear. What does "the structure of the data itself varies" mean exactly? Where are you 1st stuck following what presentation of what DB design method? Right now you are effectively asking us to (re)write a textbook with a bespoke tutorial. PS [How do you effectively model inheritance in a database?](https://stackoverflow.com/q/190296/3404097) – philipxy Jan 17 '22 at 00:12
  • The data varies because one file may have five states and next another 10, the next may only have three, etc. Most files won't have more than 10 states, but the states will very from file to file and as a result from record to record. As I commented below, I am tempted to just create a table with an ID column and 48 more columns for each state-- don't need Alaska and Hawaii. I know it will be less efficient but there should never be much more than 5,000 records in the table. – gatorreina Jan 17 '22 at 00:18
  • "Please clarify via edits, not comments." Etc etc. PS [“Which is better”](https://meta.stackexchange.com/q/204461) PS Please before considering posting read the manual/reference & google any error message & many clear, concise & precise phrasings of your question/problem/goal, with & without your particular names/strings/numbers, 'site:stackoverflow.com' & tags; read many answers. Reflect research. [ask] [Help] PS See [How do comment replies work?](https://meta.stackexchange.com/q/43019/266284) to learn to use `@x` to notify one non-sole non-poster commenter `x` re a comment. – philipxy Jan 17 '22 at 00:45

2 Answers2

2

It's a little unclear in what direction your question goes. But if you want a good relational model to store the data into, that would be three tables. One for the files. One for the states. One for the count of the states in a file. For example:

The tables:

CREATE TABLE file
             (id integer
                 AUTO_INCREMENT,
              path varchar(256)
                   NOT NULL,
              PRIMARY KEY (id),
              UNIQUE (path));

CREATE TABLE state
             (id integer
                 AUTO_INCREMENT,
              abbreviation varchar(2)
                           NOT NULL,
              PRIMARY KEY (id),
              UNIQUE (abbreviation));

CREATE TABLE occurrences
             (file integer,
              state integer,
              count integer
                    NOT NULL,
              PRIMARY KEY (file,
                           state),
              FOREIGN KEY (file)
                          REFERENCES file
                                     (id),
              FOREIGN KEY (state)
                          REFERENCES state
                                     (id),
              CHECK (count >= 0));

The data:

INSERT INTO files
            (path)
            VALUES ('521');

INSERT INTO states
            (abbreviation)
            VALUES ('OH'),
                   ('PA'),
                   ('IN'),
                   ('TX'),
                   ('IL');

INSERT INTO occurrences
            (file,
             state,
             count)
            VALUES (1,
                    1,
                    4),
                   (1,
                    2,
                    1),
                   (1,
                    3,
                    2),
                   (1,
                    4,
                    3),
                   (1,
                    4,
                    7);

The states of course would be reused. Fill the table with all 50 and use them. They should not be inserted for every file again.

You can fill occurrences explicitly with a count of 0 for file where the respective state didn't appear, if you want to distinguish between "I know it's 0." and "I don't know the count.", which would then be encoded through the absence of a corresponding row. If you don't want to distinguish that and no row means a count of 0, you can handle that in queries by using outer joins and coalesce() to "translate" to 0.

sticky bit
  • 36,626
  • 12
  • 31
  • 42
  • I believe this is a very efficient way of doing it. Nevertheless, I am tempted to just create a table with an ID column and 48 more columns for each state-- don't need Alaska and Hawaii. I know it will be less efficient but there should never be much more than 5,000 records in the table. – gatorreina Jan 17 '22 at 00:13
  • @gatorreina: That spread sheet way may bite you as soon as you want to do queries over the data for analysis. – sticky bit Jan 17 '22 at 00:43
2

There are many possible ways to store the data.

For sake of simplicity see if the following approach will be an acceptable solution for your case. The solution is base on use one table with two indexes based upon id and state columns.

CREATE TABLE IF NOT EXISTS `state_count` (
    `id`        INT NOT NULL,
    `state`     VARCHAR(2) NOT NULL,
    `count`     INT NOT NULL,
    INDEX `id` (`id`),
    INDEX `state` (`state`)
);

INSERT INTO `state_count`
    (`id`,`state`,`count`)
VALUES
    ('251','OH',4),
    ('251','PA',1),
    ('251','IN',2),
    ('251','TX',3),
    ('251','IL',7);

Sample SQL SELECT output

MySQL [dbs0897329] > SELECT * FROM state_count;
+-----+-------+-------+
| id  | state | count |
+-----+-------+-------+
| 251 | OH    |     4 |
| 251 | PA    |     1 |
| 251 | IN    |     2 |
| 251 | TX    |     3 |
| 251 | IL    |     7 |
+-----+-------+-------+
5 rows in set (0.000 sec)
MySQL [dbs0897329]> SELECT * FROM state_count WHERE state='OH';
+-----+-------+-------+
| id  | state | count |
+-----+-------+-------+
| 251 | OH    |     4 |
+-----+-------+-------+
1 row in set (0.000 sec)
MySQL [dbs0897329]> SELECT * FROM state_count WHERE state IN ('OH','TX');
+-----+-------+-------+
| id  | state | count |
+-----+-------+-------+
| 251 | OH    |     4 |
| 251 | TX    |     3 |
+-----+-------+-------+
2 rows in set (0.001 sec)
Polar Bear
  • 6,762
  • 1
  • 5
  • 12