1

I'm having trouble representing data for a project in a database-appropriate way. To help, I've set up a SQL Fiddle for the example at: http://sqlfiddle.com/#!2/6d01c2/14/0

At its simplest, I have three tables: Fund, FundReturn, and FactorReturn. Fund has many monthly returns via FundReturn. Fund also has a Region (like US, or Foreign) and these regions have what's called a FactorReturn. So the FactorReturn's relation to Fund isn't direct but rather through the Region the Fund belongs to.

**Fund**
fund_symbol  | varchar(5)    | PRI
region_key   | varchar(255)  |

**FundReturn**
fund_symbol  | varchar(255)  | PRI
return_month | int(10)       | PRI
return_value | decimal(5,4)  | 

**FactorReturn**
region_key   | varchar(255)  | PRI
factor_key   | varchar(255)  | PRI
return_month | int(10)       | PRI
return_value | decimal(5,4)  | 

The goal is to pull the fund's returns for a given period of time as well as the corresponding factor returns to perform some analysis. The final representation would look something like this:

fund_symbol | month  | fund_return | factor_ret_1 | factor_ret_2 | factor_ret_3
VTI         | 201001 | 0.0100      | 0.0200       | -0.0100      | 0.0000
VTI         | 201002 | 0.0500      | 0.0300       |  0.0300      | 0.0010
VTI         | 201003 | 0.0300      | 0.0100       | -0.0200      | 0.0020

Additionally, I'm using the PHP Doctrine ORM and have mapped the data to be able to use methods as follows:

$fund = $em->find('VTI');
$fund->getFundReturns();   // Would return array with FundReturn objects.
$fund->getFactorReturns(); // Would return array with FactorReturn objects.

Problems


Due to the nature of the data, trying to query for both fund returns as well as factor returns in one query exponentially increases the amount of rows returned. This is exacerbated once I try to use this in the ORM as it is impossible to build the object graph with a broad date range. Using multiple queries is my current solution, but this is tricky when using an ORM as the accessing the relations incorrectly from my code can trigger massive SQL queries.

The query at http://sqlfiddle.com/#!2/6d01c2/14/0 shows how 432 rows are returned when querying for just 12 months.

Questions


So my questions are:

  1. Is there a better way to represent this data in a database?
  2. What's the right way to query for data in SQL? In an ORM?
Cowlby
  • 651
  • 7
  • 16

1 Answers1

1

Firstly given that any solution will no doubt end up with the use of 'JOIN's I would make the all the foreign key integer values. This will make your table relations faster, and more economic in storage and update speed. So for example the fund table would have an integer unique ID (e.g. 1) and also a symbol (natural key e.g. VTI) You can read more here: Surrogate vs. natural/business keys

Also it seems that you may be going down the Entity Attribute Value route, which is much discussed. Entity Attribute Value Database vs. strict Relational Model Ecommerce

This Entity Attribute Value model is complicated to query but gives users flexibility to add their own factor return keys.

Secondly if you go down this route, you are probably going to struggle querying in most ORMs. I would try and build queries manually like the one below:

    SELECT f.*
      ,fr.return_month AS fund_return_month
      ,fr.return_value AS fund_return
      ,hml.return_value AS hml
      ,mkt.return_value AS mkt
      ,smb.return_value as smb

    FROM Fund f
    INNER JOIN FundReturn fr ON f.fund_symbol = fr.fund_symbol

    LEFT JOIN FactorReturn hml 
    ON f.region_key = hml.region_key
    AND hml.factor_key = 'hml'
    AND hml.return_month = fr.return_month

    LEFT JOIN FactorReturn mkt 
    ON f.region_key = mkt.region_key
    AND mkt.factor_key = 'mkt'
    AND mkt.return_month = fr.return_month

    LEFT JOIN FactorReturn smb 
    ON f.region_key = mkt.region_key
    AND smb.factor_key = 'smb'
    AND smb.return_month = fr.return_month

    WHERE f.fund_symbol = 'VTI'
    AND fr.return_month BETWEEN 201001 AND 201012
    AND hml.return_month BETWEEN 201001 AND 201012;
Community
  • 1
  • 1
JustinHui
  • 729
  • 5
  • 18
  • If the plan is to only have 5 factors, would I be better off denormalizing the table and having each factor be a column? Alternatively, some designs suggest using class table inheritance for entity-attribute-value situations but is it wrong/overkill for this type of data? – Cowlby Oct 06 '13 at 15:23
  • Are these factors (factor key) going to be different for each fund, and are they going to be mostly NULL? If so EAV (entity Value Attribute) may be the best way forward. Users won't be able to add custom factor keys with class table inheritance. I personally dislike EAV as it makes querying and reporting an absolute nightmare. – JustinHui Oct 06 '13 at 16:09
  • I was finally able to test out various strategies and I think your answer captures it all. The multiple LEFT JOIN strategy is exactly what I'm looking for on the SQL side but using it via the ORM is not straightforward. Custom hydration via something like Doctrine's Native SQL or some further selection on the PHP side seems to be necessary. Thanks! – Cowlby Oct 07 '13 at 03:47