A Quick Simulated Introduction to the specification_curve Library

Prepared for 'Sustainable Approaches to Biomedical Science: Responsible and Reproducible Research'

21st June 2021

Lets first load in our necessary dependancies; numpy for creating some random variables, pandas for data wrangling, statsmodels to build some basic OLS models, and the all important specification_curve library;

To make sure everything is reproducible, lets determine a definitive number of samples, and a seed;

Lets now define some random variables. The first three are all random floats in the half-open interval [0.0, 1.0).

and the next two are dichotomous binary variables;

Lets define our y-variable to be a simple function of these, with a degree of noise;

Turn these into a dataframe, where the first list is our list of variables, and the second is our list of variable names;

And then set the latter two variables (x4 and x_5) as categorical pandas columns;

If we ran a simple OLS model with x_1, x_2 and x_3, we might think that our x_1 variable is statistically significant!

But what about when we include x_4 and x_5?

Introducing Specification Curve Analysis! A great way to visualise all combinations of variables. They can also be expanded to include things like deterministic variables, and transformation choices!