20 Statsmodels concepts with Before-and-After Examples
Sure! Let's reformat the statsmodels examples to follow the new style you provided. Here are the examples from statsmodels, now with the before and after real samples included:
1. Linear Regression (OLS) 📈
Boilerplate Code:
import statsmodels.api as sm
X = sm.add_constant(X) # Add constant to X variables
model = sm.OLS(y, X)
results = model.fit()
Use Case: Perform a basic linear regression using OLS (Ordinary Least Squares).
Goal: Analyze the relationship between a dependent variable and independent variables. 🎯
Sample Code:
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())
Before Example:
You manually try to figure out how the variables relate without a structured model. 🤔
y = [3, 4, 5, 6]
X = [[1], [2], [3], [4]]
# No real model in place.
After Example:
With OLS, you have a fitted linear model and can interpret relationships. 📈
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.960
Model: OLS Adj. R-squared: 0.940
Method: Least Squares F-statistic: 48.00
==============================================================================
Challenge: 🌟 Fit an OLS model with multiple predictor variables and try interpreting the p-values and coefficients.
2. Logistic Regression 🔄
Boilerplate Code:
import statsmodels.api as sm
model = sm.Logit(y, X)
results = model.fit()
Use Case: Perform logistic regression for binary classification tasks.
Goal: Predict binary outcomes (e.g., success/failure) using logistic regression. 🎯
Sample Code:
model = sm.Logit(y, X)
results = model.fit()
print(results.summary())
Before Example:
You manually assign probabilities without understanding how they relate to the data. 😕
y = [1, 0, 1, 0]
X = [[3], [2], [4], [1]]
After Example:
With Logit, you get coefficients and probabilities for binary outcomes. 🔄
Logit Regression Results
==============================================================================
Dep. Variable: y Pseudo R-sq: 0.624
Model: Logit LL-Null: -23.456
Method: MLE LL-Cur: -9.125
==============================================================================
Challenge: 🌟 Perform logistic regression on a customer churn dataset to predict customer retention.
3. Time Series Analysis (ARIMA) 🕒
Boilerplate Code:
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
Use Case: Model and forecast time series data using ARIMA (AutoRegressive Integrated Moving Average).
Goal: Use ARIMA to make accurate forecasts of future data points. 🎯
Sample Code:
model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
print(results.summary())
Before Example:
You analyze raw data without considering trends or patterns over time. 🤔
y = [10, 12, 14, 16, 15, 20]
# Manually forecasting next values without capturing seasonality or trends.
After Example:
With ARIMA, you get fitted models and forecasts that capture temporal dependencies. 🕒
ARIMA Model Results
==============================================================================
Dep. Variable: y No. Observations: 6
Model: ARIMA(1, 1, 1) Log Likelihood -10.012
AIC 26.024
BIC 26.946
==============================================================================
Challenge: 🌟 Fit an ARIMA model to predict monthly sales data and compare forecasts with actual results.
4. Autoregressive (AR) Model 🔄
Boilerplate Code:
from statsmodels.tsa.ar_model import AutoReg
model = AutoReg(y, lags=2)
results = model.fit()
Use Case: Model time series data using an autoregressive model, which predicts values based on previous observations.
Goal: Capture lagged relationships in time series data with AR. 🎯
Sample Code:
model = AutoReg(y, lags=2)
results = model.fit()
print(results.summary())
Before Example:
Manually trying to predict values without considering the past observations. 😕
y = [100, 120, 130, 125]
# No lagged model to capture autocorrelation.
After Example:
AR model captures autocorrelations, leading to more accurate forecasts. 🔄
AutoReg Model Results
==============================================================================
Dep. Variable: y No. Observations: 4
Model: AutoReg(2) Log Likelihood -10.234
AIC 24.469
BIC 26.893
==============================================================================
Challenge: 🌟 Fit an AR model to monthly temperature data and forecast future values based on lagged observations.
5. Poisson Regression 🧮
Boilerplate Code:
import statsmodels.api as sm
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()
Use Case: Model count data where the dependent variable represents counts (e.g., number of events).
Goal: Use Poisson regression to predict the frequency of events. 🎯
Sample Code:
model = sm.GLM(y, X, family=sm.families.Poisson())
results = model.fit()
print(results.summary())
Before Example:
Predicting counts (e.g., website clicks) without considering the distribution of count data. 🤔
y = [5, 7, 9, 12]
X = [[1], [2], [3], [4]]
After Example:
Poisson regression models count data with non-negative integer outputs. 🧮
Poisson Regression Results
==============================================================================
Dep. Variable: y No. Observations: 4
Model: GLM Df Residuals: 2
Model Family: Poisson Df Model: 1
==============================================================================
Challenge: 🌟 Apply Poisson regression to model the number of daily website visits based on ad spending.
6. ANOVA (Analysis of Variance) 🎲
Boilerplate Code:
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
model = ols('y ~ C(group)', data=df).fit()
anova_results = anova_lm(model)
Use Case: Test for significant differences between group means.
Goal: Use ANOVA to compare the mean of two or more groups. 🎯
Sample Code:
model = ols('y ~ C(group)', data=df).fit()
anova_results = anova_lm(model)
print(anova_results)
Before Example:
You try to compare means of groups manually without statistical testing. 😬
group1 = [3, 4, 5]
group2 = [7, 8, 9]
After Example:
ANOVA provides statistical significance between group means. 🎲
ANOVA Table
=========================================================================
df sum_sq mean_sq F PR(>F)
group 1 102.667 102.667 19.35413 0.003
Challenge: 🌟 Run an ANOVA test on test scores across different teaching methods and interpret the significance of results.
7. Durbin-Watson Test (Autocorrelation) 🔍
Boilerplate Code:
from statsmodels.stats.stattools import durbin_watson
dw_stat = durbin_watson(results.resid)
Use Case: Test for autocorrelation in the residuals of a regression model.
Goal: Ensure residuals are independent and not autocorrelated using the Durbin-Watson Test. 🎯
Sample Code:
dw_stat = durbin_watson(results.resid)
print(f'Durbin-Watson statistic: {dw_stat}')
Before Example:
Residuals exhibit autocorrelation, leading to biased results. 😬
Residuals: [0.1, 0.2, 0.15, 0.3, 0.25] # Showing some autocorrelation.
After Example:
Durbin-Watson test helps ensure independent residuals. 🔍
Durbin-Watson statistic: 2.12 # A value near 2 indicates no autocorrelation.
Challenge: 🌟 Apply the Durbin-Watson test on a regression model’s residuals to detect autocorrelation and assess the model’s assumptions.
8. Ljung-Box Test (Autocorrelation in Residuals) 🔎
Boilerplate Code:
from statsmodels.stats.diagnostic import acorr_ljungbox
ljung_box_results = acorr_ljungbox(results.resid, lags=[10])
Use Case: Test whether residuals of a time series model are autocorrelated.
Goal: Use the Ljung-Box Test to determine if residuals are independent and not autocorrelated over time. 🎯
Sample Code:
ljung_box_results = acorr_ljungbox(results.resid, lags=[10])
print(ljung_box_results)
Before Example:
Residuals in a time series model show patterns over time. 😬
Residuals: [0.3, 0.25, 0.35, 0.3, 0.4] # Residuals follow some trend.
After Example:
The Ljung-Box Test detects whether the residuals are independent. 🔎
Ljung-Box p-value: 0.08 # A p-value close to 1 indicates no autocorrelation.
Challenge: 🌟 Use the Ljung-Box test on ARIMA model residuals to check for autocorrelation and validate model assumptions.
9. Quantile Regression 📊
Boilerplate Code:
import statsmodels.api as sm
model = sm.QuantReg(y, X)
results = model.fit(q=0.5) # Median regression
Use Case: Perform quantile regression to model relationships between variables at different quantiles (e.g., median, 90th percentile).
Goal: Use Quantile Regression to understand how predictors affect different parts of the outcome distribution. 🎯
Sample Code:
model = sm.QuantReg(y, X)
results = model.fit(q=0.5) # Median regression
print(results.summary())
Before Example:
Modeling only the mean of a distribution might miss insights on the tails. 😕
Mean regression results: Coefficient = 2.0 # Captures only central tendency.
After Example:
Quantile regression provides insights into different parts of the distribution (e.g., median, 90th percentile). 📊
Quantile regression (median): Coefficient = 1.5 # Different insights at different quantiles.
Challenge: 🌟 Apply quantile regression on income data to understand disparities at the 10th, 50th, and 90th percentiles.
10. Autoregressive Integrated Moving Average (ARIMA) 🕒
Boilerplate Code:
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
Use Case: Model and forecast time series data with trends, seasonality, and noise using ARIMA.
Goal: Use ARIMA to forecast time series data that accounts for autocorrelation, trends, and seasonality. 🎯
Sample Code:
model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
print(results.summary())
Before Example:
Manually trying to forecast without a structured model for time series. 😩
Future values based on simple extrapolation, ignoring seasonality.
After Example:
With ARIMA, you model the trend, seasonality, and noise in time series data for accurate forecasts. 🕒
ARIMA Results: AIC = 250.0, BIC = 260.0, Coefficients estimated for AR, I, and MA terms.
Challenge: 🌟 Fit an ARIMA model on monthly sales data and use it to forecast the next 12 months.
11. Pooled OLS (Panel Data) 📊
Boilerplate Code:
from linearmodels import PooledOLS
pooled_model = PooledOLS(y, X).fit()
Use Case: Fit a pooled OLS model to analyze panel data, where observations are repeated over time across entities (e.g., multiple companies, countries).
Goal: Use Pooled OLS when time-invariant variables are present and you want to estimate the same coefficients for all entities. 🎯
Sample Code:
pooled_model = PooledOLS(y, X).fit()
print(pooled_model.summary())
Before Example:
You have panel data but aren’t considering the time dimension, treating the data as cross-sectional. 🤔
Only analyzing data without accounting for the multiple periods or entities.
After Example:
With Pooled OLS, you account for repeated observations, fitting the model to panel data. 📊
PooledOLS Model Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: PooledOLS Log Likelihood -234.567
AIC 492.34
==============================================================================
Challenge: 🌟 Try using Pooled OLS on a dataset with multiple companies' financial data over 5 years and interpret the output.
12. Fixed Effects Model (Panel Data) 🔐
Boilerplate Code:
from linearmodels import PanelOLS
fixed_effects_model = PanelOLS(y, X, entity_effects=True).fit()
Use Case: Control for time-invariant differences across entities (like companies or countries) in panel data.
Goal: Use Fixed Effects to eliminate the influence of unobserved, time-invariant characteristics that differ across entities. 🎯
Sample Code:
fixed_effects_model = PanelOLS(y, X, entity_effects=True).fit()
print(fixed_effects_model.summary())
Before Example:
Ignoring unobserved heterogeneity across entities, leading to biased results. 😕
Same coefficients applied across all entities, ignoring individual differences.
After Example:
Fixed Effects account for these time-invariant characteristics, offering a more refined model. 🔐
Fixed Effects Model Results: Coefficients specific to each entity, R-squared improved.
Challenge: 🌟 Compare Fixed Effects to Pooled OLS on a dataset with company profits over time to see the effect of controlling for entity-specific variables.
13. Random Effects Model (Panel Data) 🎯
Boilerplate Code:
from linearmodels import RandomEffects
random_effects_model = RandomEffects(y, X).fit()
Use Case: Analyze panel data where unobserved effects are random and assumed to be uncorrelated with the independent variables.
Goal: Use Random Effects when you believe that differences across entities have a random component and are uncorrelated with your predictors. 🎯
Sample Code:
random_effects_model = RandomEffects(y, X).fit()
print(random_effects_model.summary())
Before Example:
Fitting a simple pooled OLS model and assuming no random variation across entities. 🤔
Unaccounted for random variations across entities.
After Example:
Random Effects model handles random variations across entities, improving model fit. 🎯
RandomEffects Model Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: RandomEffects Log Likelihood -204.123
AIC 412.245
==============================================================================
Challenge: 🌟 Fit both Random Effects and Fixed Effects models to panel data on customer satisfaction surveys and compare the results.
14. Durbin-Wu-Hausman Test (Fixed vs Random Effects) ⚖️
Boilerplate Code:
from linearmodels.panel import compare
results = compare(fixed_effects_model, random_effects_model)
Use Case: Decide between using Fixed Effects or Random Effects for panel data analysis by testing for endogeneity.
Goal: Apply the Durbin-Wu-Hausman Test to check whether to use fixed or random effects based on whether the entity-specific effects are correlated with your independent variables. 🎯
Sample Code:
results = compare(fixed_effects_model, random_effects_model)
print(results)
Before Example:
Uncertainty whether to apply fixed or random effects to panel data. 😩
Analysis proceeds without checking for endogeneity, leading to model bias.
After Example:
The Hausman Test helps choose the correct model by checking for correlations in random effects. ⚖️
Hausman Test Results: p-value > 0.05: Random effects is preferable.
Challenge: 🌟 Perform the Hausman Test on panel data from multiple industries and interpret the p-value to decide between fixed or random effects.
15. Heteroskedasticity Test (Breusch-Pagan Test) 📊
Boilerplate Code:
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(results.resid, results.model.exog)
Use Case: Test for heteroskedasticity (non-constant variance) in the residuals of a regression model.
Goal: Use the Breusch-Pagan Test to detect whether the variance of the residuals is constant across observations. 🎯
Sample Code:
bp_test = het_breuschpagan(results.resid, results.model.exog)
print(f'Breusch-Pagan p-value: {bp_test[1]}')
Before Example:
You fit a regression model but don't check for heteroskedasticity, leading to inefficient estimates. 😕
Residuals: [0.3, 0.25, 0.9, 0.8, 0.75] # Possible signs of heteroskedasticity.
After Example:
With the Breusch-Pagan test, you confirm whether or not heteroskedasticity is present. 📊
Breusch-Pagan p-value: 0.002 # p-value indicates heteroskedasticity is present.
Challenge: 🌟 Use the Breusch-Pagan Test on a dataset where heteroskedasticity might be present, such as housing prices or income distribution.
16. Ljung-Box Test (Autocorrelation in Residuals) 🔎
Boilerplate Code:
from statsmodels.stats.diagnostic import acorr_ljungbox
ljung_box_results = acorr_ljungbox(results.resid, lags=[10])
Use Case: Test for autocorrelation in the residuals of a time series model.
Goal: Use the Ljung-Box Test to ensure residuals are independent and not autocorrelated. 🎯
Sample Code:
ljung_box_results = acorr_ljungbox(results.resid, lags=[10])
print(ljung_box_results)
Before Example:
Residuals show autocorrelation, causing bias in time series predictions. 😩
Residuals: [0.4, 0.35, 0.45, 0.5, 0.55] # Residuals show a clear trend over time.
After Example:
The Ljung-Box Test confirms if residuals are independent. 🔎
Ljung-Box p-value: 0.15 # No significant autocorrelation in residuals.
Challenge: 🌟 Apply the Ljung-Box Test on ARIMA model residuals to validate independence assumptions in time series analysis.
17. Seasonal Decomposition of Time Series (STL Decomposition) 📅
Boilerplate Code:
from statsmodels.tsa.seasonal import STL
stl = STL(y, seasonal=13)
result = stl.fit()
Use Case: Decompose a time series into its trend, seasonal, and residual components.
Goal: Use STL Decomposition to understand the underlying patterns in a time series, especially for data with strong seasonal components. 🎯
Sample Code:
stl = STL(y, seasonal=13)
result = stl.fit()
result.plot()
Before Example:
You struggle to separate seasonal patterns from the underlying trend. 😕
Time series plot shows seasonal fluctuation, but hard to identify the pattern.
After Example:
STL Decomposition reveals trend, seasonality, and residuals separately. 📅
STL Decomposition plot shows clear trend, seasonal, and residual components.
Challenge: 🌟 Apply STL Decomposition to retail sales data to identify seasonal trends in product demand.
18. Quantile Regression 📊
Boilerplate Code:
import statsmodels.api as sm
model = sm.QuantReg(y, X)
results = model.fit(q=0.5) # Median regression
Use Case: Perform quantile regression to model different quantiles (e.g., median, 90th percentile) of the response variable.
Goal: Use Quantile Regression to understand how predictors affect different parts of the distribution (e.g., the tails or median). 🎯
Sample Code:
model = sm.QuantReg(y, X)
results = model.fit(q=0.5) # Median regression
print(results.summary())
Before Example:
You only model the mean of the distribution, missing insights about extreme values. 😞
Mean regression shows a relationship between predictors and the response variable.
After Example:
With Quantile Regression, you model different parts of the distribution, gaining insight into the extremes. 📊
Quantile regression (median) results show a different relationship compared to the mean.
Challenge: 🌟 Apply quantile regression on a salary dataset to understand disparities at the 10th, 50th, and 90th percentiles.
19. Generalized Estimating Equations (GEE) 📉
Boilerplate Code:
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.families import Gaussian
model = GEE(y, X, groups=group_var, family=Gaussian())
results = model.fit()
Use Case: Fit marginal models for correlated data, such as repeated measures or clustered data.
Goal: Use GEE to handle within-group correlation and produce robust estimates in the presence of correlated observations. 🎯
Sample Code:
model = GEE(y, X, groups=group_var, family=Gaussian())
results = model.fit()
print(results.summary())
Before Example:
You ignore within-group correlation, leading to biased estimates for repeated measures data. 😕
Repeated measures data shows correlation within groups, but it's not accounted for.
After Example:
With GEE, you account for within-group correlation and produce more accurate estimates. 📉
GEE model results show adjusted coefficients accounting for within-group correlation.
Challenge: 🌟 Fit a GEE model to analyze patient health outcomes measured over multiple visits to the clinic.
20. Autoregressive Integrated Moving Average (ARIMA) 🕒
Boilerplate Code:
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
Use Case: Model and forecast time series data using ARIMA, which captures autoregressive, integrated, and moving average components.
Goal: Use ARIMA to handle trends, seasonality, and noise in time series data. 🎯
Sample Code:
model = ARIMA(y, order=(1, 1, 1))
results = model.fit()
print(results.summary())
Before Example:
Without ARIMA, you're unable to accurately forecast time series data because trends and patterns are ignored. 😕
Time series predictions based on a simple model ignore trends and autocorrelations.
After Example:
With ARIMA, you model the trend, seasonality, and noise in time series data for more accurate forecasts. 🕒
ARIMA results: Captured autocorrelations, trends, and moving averages for better forecasting.
Challenge: 🌟 Fit an ARIMA model on energy consumption data to forecast future usage based on historical patterns.