from IPython.display import Image
Image(open("forking_paths.gif",'rb').read())
Lets first load in our necessary dependancies; numpy for creating some random variables, pandas for data wrangling, statsmodels to build some basic OLS models, and the all important specification_curve library;
import numpy as np
import pandas as pd
import statsmodels.api as sm
from specification_curve import specification_curve as specy
To make sure everything is reproducible, lets determine a definitive number of samples, and a seed;
n_samples = 300
np.random.seed(1342)
Lets now define some random variables. The first three are all random floats in the half-open interval [0.0, 1.0).
x_1 = np.random.random(size=n_samples)
x_2 = np.random.random(size=n_samples)
x_3 = np.random.random(size=n_samples)
and the next two are dichotomous binary variables;
x_4 = np.random.randint(2, size=n_samples)
x_5 = np.random.randint(2, size=n_samples)
Lets define our y-variable to be a simple function of these, with a degree of noise;
y = (0.5*x_1 + 0.1*x_2 + 0.5*x_3 + x_4*0.6 + x_4*0.9 + x_5*0.4
+ 3*np.random.randn(n_samples))
Turn these into a dataframe, where the first list is our list of variables, and the second is our list of variable names;
df = pd.DataFrame([x_1, x_2, x_3, x_4, x_5, y],
['x_1', 'x_2', 'x_3', 'x_4', 'x_5', 'y']).T
And then set the latter two variables (x4 and x_5) as categorical pandas columns;
df['x_4'] = df['x_4'].astype('category')
df['x_5'] = df['x_5'].astype('category')
If we ran a simple OLS model with x_1, x_2 and x_3, we might think that our x_1 variable is statistically significant!
X = df[['x_1', 'x_2', 'x_3']]
ols_reg1 = sm.OLS(df[['y']], X.astype(float)).fit()
ols_reg1.summary()
Dep. Variable: | y | R-squared (uncentered): | 0.210 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.202 |
Method: | Least Squares | F-statistic: | 26.30 |
Date: | Wed, 16 Jun 2021 | Prob (F-statistic): | 4.09e-15 |
Time: | 16:04:38 | Log-Likelihood: | -749.34 |
No. Observations: | 300 | AIC: | 1505. |
Df Residuals: | 297 | BIC: | 1516. |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
x_1 | 1.3314 | 0.494 | 2.694 | 0.007 | 0.359 | 2.304 |
x_2 | 1.6684 | 0.467 | 3.573 | 0.000 | 0.750 | 2.587 |
x_3 | -0.2838 | 0.491 | -0.578 | 0.564 | -1.250 | 0.683 |
Omnibus: | 1.342 | Durbin-Watson: | 2.040 |
---|---|---|---|
Prob(Omnibus): | 0.511 | Jarque-Bera (JB): | 1.079 |
Skew: | -0.124 | Prob(JB): | 0.583 |
Kurtosis: | 3.157 | Cond. No. | 3.19 |
But what about when we include x_4 and x_5?
X = df[['x_1', 'x_2', 'x_3', 'x_4', 'x_5']]
ols_reg2 = sm.OLS(df[['y']], X.astype(float)).fit()
ols_reg2.summary()
Dep. Variable: | y | R-squared (uncentered): | 0.262 |
---|---|---|---|
Model: | OLS | Adj. R-squared (uncentered): | 0.249 |
Method: | Least Squares | F-statistic: | 20.92 |
Date: | Wed, 16 Jun 2021 | Prob (F-statistic): | 6.83e-18 |
Time: | 16:04:38 | Log-Likelihood: | -739.16 |
No. Observations: | 300 | AIC: | 1488. |
Df Residuals: | 295 | BIC: | 1507. |
Df Model: | 5 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
x_1 | 0.5475 | 0.511 | 1.071 | 0.285 | -0.459 | 1.554 |
x_2 | 1.1525 | 0.473 | 2.438 | 0.015 | 0.222 | 2.083 |
x_3 | -0.7013 | 0.489 | -1.434 | 0.153 | -1.664 | 0.261 |
x_4 | 1.3581 | 0.318 | 4.269 | 0.000 | 0.732 | 1.984 |
x_5 | 0.4174 | 0.321 | 1.299 | 0.195 | -0.215 | 1.050 |
Omnibus: | 1.054 | Durbin-Watson: | 1.996 |
---|---|---|---|
Prob(Omnibus): | 0.590 | Jarque-Bera (JB): | 0.796 |
Skew: | -0.019 | Prob(JB): | 0.672 |
Kurtosis: | 3.249 | Cond. No. | 4.27 |
Introducing Specification Curve Analysis! A great way to visualise all combinations of variables. They can also be expanded to include things like deterministic variables, and transformation choices!
sc = specy.SpecificationCurve(df, 'y', 'x_1', ['x_2', 'x_3', 'x_4', 'x_5'],
cat_expand=['x_4', 'x_5'])
sc.fit()
sc.plot()
Fit complete