2

Suppose that two variables X and Y are causally and linearly related, so that an increase in X produces an increase in Y (e.g. travel distance for cars and their fuel consumption). Both X and Y are vectors of N observations (N individual cars in the example).

A way to represent such a relation is a simple linear equation Yi = a + bXi, which would describe the relation in the sample of N cases, where i = 1, 2, ..., N. Here a and b are constants, while Y and X are variables.

Do you have any suggestions how this could be represented in Prolog? My hunch is something like causes(cause(travelDistance), effect(fuelConsumption), a(0.5), b(1.23)).. What seems missing here, however, is code which states that the association specifically is between the ith value of X and the ith value of Y (a car's travel distance and that car's fuel consumption).

Any ideas? Thanks in advance!

/JC

  • 2
    IMHO you should change the title. Your title caught my interest but the title is to generic. When I read it I was thinking your answer might be done with a NoSQL database or [graph database](https://en.wikipedia.org/wiki/Graph_database) such as Neo4j, but after reading the question I think your answer would be better done as a constraint satisfaction, e.g. [Constraint Logic Programming](http://www.swi-prolog.org/pldoc/man?section=clp). However since I am not an expert of CLP I can't answer. – Guy Coder Jul 12 '18 at 12:14
  • 2
    It's a little unclear to me as I'm not sure what a query looks like in the system you are envisioning. But based upon the little bit you've described, you might have, `causes(cause(travelDistance), effect(fuelConsumption), a(X), b(Y)) :- Y #= a + b*X.` I'm assuming `a` and `b` are actual numeric values here? It's not clear what these mean in the context of your problem, or whether they are variable or constant. Do they relate to the `a/1` and `b/1` in your compound term? You wouldn't really need `cause/1` and `effect/1` terms since they're meaning is positionally determined in the arg list. – lurker Jul 12 '18 at 12:38
  • @GuyCoder i updated the title. –  Jul 12 '18 at 17:15
  • @lurker. Thanks for your suggestions. Correct, a and b are constant numerical values. Essentially a describes where the line crosses the Y axis and b describes the slope of the line. Based on your suggestions i think `causes(Distance, FuelConsumption):- FuelConsumption is 0.5 + 1.23 * Distance.` would be an accurate representation. If i understand it correctly #= is a way to specify a constraint; i'm not familiar with CLP but i'll look into it :-) –  Jul 12 '18 at 17:21
  • 1
    Yes, that is correct. `#=` uses CLP(FD) which is a way of providing numeric constraints for integers. So what I offered first wouldn't work as-is. You would need to "scale" to integers if that option is available to you (*e.g.*, use 50 and 123, then divide by 100 at the end). – lurker Jul 12 '18 at 17:28
  • I'm confused what the question is asking. Are you simply asking how to represent the predicate `Y is a linear model of X with parameters a and b`? Wouldn't this simply be something like `islinearmodel(Y, X, a, b) :- Out is a + b * X, Y = Out.` ? – Tasos Papastylianou Jul 12 '18 at 17:58
  • @TasosPapastylianou yes a linear model but with the restriction that it is a causal relation (vs correlation). –  Jul 12 '18 at 18:40
  • 1
    I'm still confused. Do you want such a predicate to confirm the link between two variables for individual observations? Or do you want to pass a while dataset, and assert whether X causes Y (in all cases). In the latter instance, I would imagine you could go down the line of proving this by demonstrating that they always correlate in one direction, but not necessarily in the reverse). Or, different still, do you simply wish to 'define' such a model as causal within the predicate, as a simple flag? – Tasos Papastylianou Jul 12 '18 at 18:59
  • @TasosPapastylianou that the causal relation exists on the dataset as a whole, i.e. between variables (which in turn consist of individual cases). I.e. if C and E are causally related, lower values of C should go together with lower values of E, and higher values of C should go together with higher values of E. So the predicate should describe a linear (causal) relation where the line consists of individual observations. Is there a better way to express this than what we came up with? –  Jul 12 '18 at 20:44
  • Cool problem! Shouldn't you be able to reduce every point on the line to a ratio that never changes under this scenario? If it matches the ratio, it's a point on the line. If any random sample of a potentially infinite number of points all match, then there must be some sort of relation? Maybe I'm oversimplifying this. – G_V Jul 19 '18 at 12:08
  • @G_V Yes, but with unstandardized values you also need an intercept... –  Jul 19 '18 at 17:20
  • @JCR - So basically a range of tolerable deviations from the ratio in order to prove statistical significance despite the imperfection of observational data? At 0 km traveled you'd have say 1:10 fuel ratio so for 0 km traveled you'd expect 0*10 fuel used. In prolog you can create paths to take to handle the reverse too. You can use a `var(Var)` check to see which ones are filled in order to calculate the other as output. `rule(X,Y) :- var(X), nonvar(Y), X is Y/YRatio` and you do this for every situation possible that can be calculated. This allows prolog to find any value for any given X or Y. – G_V Jul 20 '18 at 08:35

2 Answers2

2

Forgive the fact that I'm answering, only to use a more appropriate format than comments, though this may not be the answer you're looking for at this point.

Unless I have misunderstood your question, I think the problem you describe here is an ill-defined / ill-described problem. My understanding of it is that you have a dataset of X and Y, which happen to follow a linear relationship, and you want to either 'infer' that X causes Y in the absence of any other information, or simply have a way to describe this is the case via a predicate. The problem is that, a correlated dataset can never give you that information by itself.

I you want to establish causality from a dataset, you need to describe what type of causality you're after and how that could be asserted and investigated first. Having a dataset that can never tell you nothing about causality if you don't know the ordering of events, or how alternatives behave.

I'm sure there are many models of causality out there, I have only come across two used meaningfully in practice: the chronological model, and the counterfactual model.

In the chronological model, if you are able to establish 'when' an event happens, then you can infer causality via a very simple "and X comes before Y" rule. E.g. if "X = travel" is deemed to take place before "Y = fuel-measurement", then you can establish causality using predicate logic, by showing that:

  • Whenever travel precedes fuel-measurement, the relationship is always necessarily linear
  • When fuel-measurement precedes travel, the relationship is not necessarily linear. (because if it were, then you're back to only being able to establish correlation rather than causality)
  • The closed world phenomenon applies (i.e. there is nothing else that contributes to fuel consumption in the absence of travel)

In the counterfactual model, you don't have any information about the chronology of the events, but what you do have is information on alternative events. Therefore causality of "X causes Y" is established by it's counterfactual, i.e. if you can show that "Had X not happened, Y would not have happened either" (or equivalently ¬X implies ¬Y).

A complicating factor in the counterfactual model is that it allows for the concept of 'responsibility', i.e. if both X and ¬X can result in Y, then they are both said to be potential causes for Y. However in the context of a dataset you can probably get around this by saying "if for ALL events X, the outcome is Y, whereas it is not necessarily true that for ALL events ¬X the outcome is Y, then we can infer that X causes Y". So, in your specific example, you could set up a world such that

  • Fuel consumption can either only occur from a 'travel' event or an alternative hypothesis which constitutes the non-travel event and is a mutually exclusive event, e.g. say, 'siphoning'
  • Both the travel 'event' and the siphoning 'event' result in a physical measurement, e.g. distance traveled. (which, in our trivial example, would probably just be zero for the siphoning event).
  • In your dataset you have information on 'both' what event occurred (e.g. travel or siphoning) and information on fuel consumption and distance travelled for that instance.

You can then establish that 'travelling' as an event 'causes' fuel consumption in a linear model fashion with respect to the distance traveled, by showing that:

  • Whenever you have a 'travel' event, the distance traveled does indeed correspond to fuel consumption according to your linear model
  • Whenever you have a 'siphoning' event, the distance traveled does not 'necessarily' correspond to fuel consumption according to that model.

Update to address the comment: the question is not one of inferring causality, but how to represent causality under the assumption that causality has already been established in practice. In this case, the above points still apply, since you need to define more clearly which type of causality you are referring to before you can represent it.

For example, if we are talking about events that occur in strict chronological order, chronological causality might look something like this (in prolog-like pseudocode):

%%%%%%%%%%%%%%%%%%
%%% facts database
%%%%%%%%%%%%%%%%%%

% eventtype/1: defines type of event
eventtype('travel')
eventtype('fuel_measurement') % ... etc

% eventtime/2: defines timepoints by index and a record of actual time
eventtime(1, "12:02am")
eventtime(2, "12:03am") % ... etc

% event/3: ['event type', 'time', 'related measurement']
event( [eventtype('travel'),           eventtime(1, _), 50km] )
event( [eventtype('fuel-measurement'), eventtime(2, _), 5L  ] ) % ... etc

%%%%%%%%%%%%%
%%% relations
%%%%%%%%%%%%%

immediately_precedes( event(X), event(Y) ) :- 
  get_eventtime_index(X, Xind),
  get_eventtime_index(Y, Yind),
  plus_one(Xind, Yind).   % assumes all above helper predicates are suitably defined elsewhere

is_linearly_related( event(X), event(Y) ) :- 
  get_measurement(X, Xmeas), 
  get_measurement(Y, Ymeas), 
  Model is a + b * Xmeas, 
  Ymeas = Model.

iscausal( eventtype(Xtype), eventtype(Ytype) ) :-   % expressed as pseudocode
  forall: 
    [event(X), event(Y)], 
    X = [Xtype, Xtime, Xmeas], 
    Y = [Ytype, Ytime, Ymeas], 
    immediately_precedes( event(X), event(Y) )
  it applies that:
    is_linearly_related( event(X), event(Y) )
Tasos Papastylianou
  • 21,371
  • 2
  • 28
  • 57
  • 1
    Hi! My question was not how to demonstrate or infer causality from data or experiments, instead it was about how to represent causal relations in Prolog. The premise is that a causal link between two variables has been shown. –  Jul 13 '18 at 11:46
2

Based on your suggestions I think this code answers my original question. Thanks!

:-use_module(library(clpfd)).

causes(
          var(
              name(distance),
              value(Distance)
          ),
          var(
              name(fuelConsumption),
              value(FuelConsumption)
          )
)
:-
FuelConsumption #= 5 + 2 * Distance.

And a sample query:

?-causes(var(name(N), value(V)), var(name(fuelConsumption), value(3))).

Which yields N = distance,V = -1