I recently read an __ article about P-Values__ that led to some discussion with my Itron colleagues, along with some thoughts on the matter.

In the context of regression models, we utilize P-Values to evaluate the statistical significance of our X-variables, which are the driver variables (e.g. weather, economics, etc.). The NULL hypothesis is that there is NO relationship between the X-variable and the Y-variable. The P-value is the probability with which we can REJECT the null hypothesis. So, if the P-value is 5%, we can reject the NULL hypothesis of there being no relationship 5% of the time. By extension, we fail to reject the NULL hypothesis 95% (100%- 5%) of the time (if we ran the “experiment” a large number of times). In layman’s terms, we can say there is a relationship between these two variables 95% of the time.

One of the common misconceptions about statistical significance is that it implies causality. I can build an overly simple model in which daily energy is a function of a constant and cooling degree days (CDD):

This model generates the following coefficients and P-Values—the constant and the CDD variables are highly significant. In fact, they are significant at a level even lower than 1% as evidenced by the fact that we can’t see any non-zero values even when two decimals are displayed.

What if I turned my equation around? Instead of energy being a function of CDD, what if I built a model of CDD being a function of energy?

This model is awesome too. Both variables are highly significant. Does this mean higher energy usage makes it hot outside? That’s clearly not the case. Rather, higher temperatures change human behavior and the behavior of cooling equipment, both of which lead to more energy consumption.

Here is another example. When we develop monthly residential average-use models (kwh per customer), we tend to include a variable for the price of electricity. I can build a very simple model in which average use is a function of a constant and the price:

In this case, the coefficient on the price term is positive which suggests that higher prices lead to higher energy consumption. That is a clear violation of the laws of supply-and-demand. What kind of madness is this?!

Again, this is an issue of causality. Remember, the delivery charges for electricity are regulated. Utilities know that people use more electricity in the summer. So, they set their rates higher for the summer months. It is not that higher prices lead to higher usage. On the contrary, it is the well-understood behavior of the seasonal nature of electricity consumption, wherein higher summer usage leads to utilities setting their rates higher at that time. As a side note, this is exactly why we tend to use a 12-month moving-average of the price in models, which allows us to capture the overall trend in prices rather than the seasonal pattern.

We can see from these two examples that the P-value itself is a very powerful indicator of correlation, but not necessarily an indicator of causality. You should beware when implying causality from your models.

Be sure to tune in for our next blog and attend one of our upcoming workshops or brown bags. Stuart McMenamin’s next brown bag is on May 21 focusing on Improving Financial Analysis with AMI. Visit our workshop page to register today at ** www.itron.com/forecastingworkshops**.