Generalized Estimating Equations¶
Generalized Estimating Equations estimate generalized linear models for panel, cluster or repeated measures data when the observations are possibly correlated withing a cluster but uncorrelated across clusters. It supports estimation of the same one-parameter exponential families as Generalized Linear models (GLM).
See Module Reference for commands and arguments.
Examples¶
The following illustrates a Poisson regression with exchangeable correlation within clusters using data on epilepsy seizures.
In [1]: import statsmodels.api as sm
In [2]: import statsmodels.formula.api as smf
In [3]: data = sm.datasets.get_rdataset('epil', package='MASS').data
URLErrorTraceback (most recent call last)
<ipython-input-3-4e55b8bec212> in <module>()
----> 1 data = sm.datasets.get_rdataset('epil', package='MASS').data
/build/statsmodels-pvXZzi/statsmodels-0.8.0~rc1+git59-gef47cd9/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in get_rdataset(dataname, package, cache)
287 "master/doc/"+package+"/rst/")
288 cache = _get_cache(cache)
--> 289 data, from_cache = _get_data(data_base_url, dataname, cache)
290 data = read_csv(data, index_col=0)
291 data = _maybe_reset_index(data)
/build/statsmodels-pvXZzi/statsmodels-0.8.0~rc1+git59-gef47cd9/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _get_data(base_url, dataname, cache, extension)
218 url = base_url + (dataname + ".%s") % extension
219 try:
--> 220 data, from_cache = _urlopen_cached(url, cache)
221 except HTTPError as err:
222 if '404' in str(err):
/build/statsmodels-pvXZzi/statsmodels-0.8.0~rc1+git59-gef47cd9/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/datasets/utils.pyc in _urlopen_cached(url, cache)
209 # not using the cache or didn't find it in cache
210 if not from_cache:
--> 211 data = urlopen(url).read()
212 if cache is not None: # then put it in the cache
213 _cache_it(data, cache_path)
/usr/lib/python2.7/urllib2.pyc in urlopen(url, data, timeout, cafile, capath, cadefault, context)
152 else:
153 opener = _opener
--> 154 return opener.open(url, data, timeout)
155
156 def install_opener(opener):
/usr/lib/python2.7/urllib2.pyc in open(self, fullurl, data, timeout)
427 req = meth(req)
428
--> 429 response = self._open(req, data)
430
431 # post-process response
/usr/lib/python2.7/urllib2.pyc in _open(self, req, data)
445 protocol = req.get_type()
446 result = self._call_chain(self.handle_open, protocol, protocol +
--> 447 '_open', req)
448 if result:
449 return result
/usr/lib/python2.7/urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
405 func = getattr(handler, meth_name)
406
--> 407 result = func(*args)
408 if result is not None:
409 return result
/usr/lib/python2.7/urllib2.pyc in https_open(self, req)
1239 def https_open(self, req):
1240 return self.do_open(httplib.HTTPSConnection, req,
-> 1241 context=self._context)
1242
1243 https_request = AbstractHTTPHandler.do_request_
/usr/lib/python2.7/urllib2.pyc in do_open(self, http_class, req, **http_conn_args)
1196 except socket.error, err: # XXX what error?
1197 h.close()
-> 1198 raise URLError(err)
1199 else:
1200 try:
URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:661)>
In [4]: fam = sm.families.Poisson()
In [5]: ind = sm.cov_struct.Exchangeable()
In [6]: mod = smf.gee("y ~ age + trt + base", "subject", data,
...: cov_struct=ind, family=fam)
...:
KeyErrorTraceback (most recent call last)
<ipython-input-6-ff1703408948> in <module>()
1 mod = smf.gee("y ~ age + trt + base", "subject", data,
----> 2 cov_struct=ind, family=fam)
/build/statsmodels-pvXZzi/statsmodels-0.8.0~rc1+git59-gef47cd9/debian/python-statsmodels/usr/lib/python2.7/dist-packages/statsmodels/genmod/generalized_estimating_equations.pyc in from_formula(cls, formula, groups, data, subset, time, offset, exposure, *args, **kwargs)
668
669 if type(groups) == str:
--> 670 groups = data[groups]
671
672 if type(time) == str:
KeyError: 'subject'
In [7]: res = mod.fit()
In [8]: print(res.summary())
OLS Regression Results
==============================================================================
Dep. Variable: write R-squared: 0.320
Model: OLS Adj. R-squared: 0.309
Method: Least Squares F-statistic: 30.73
Date: Mon, 26 Dec 2016 Prob (F-statistic): 2.51e-16
Time: 18:47:33 Log-Likelihood: -694.54
No. Observations: 200 AIC: 1397.
Df Residuals: 196 BIC: 1410.
Df Model: 3
Covariance Type: nonrobust
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
Intercept 51.1161 2.018 25.324 0.000 47.135 55.097
C(readcat, Poly).Linear 11.3501 5.348 2.122 0.035 0.803 21.897
C(readcat, Poly).Quadratic 3.8954 4.037 0.965 0.336 -4.066 11.857
C(readcat, Poly).Cubic -2.4598 1.998 -1.231 0.220 -6.400 1.480
==============================================================================
Omnibus: 9.741 Durbin-Watson: 1.699
Prob(Omnibus): 0.008 Jarque-Bera (JB): 10.263
Skew: -0.535 Prob(JB): 0.00591
Kurtosis: 2.703 Cond. No. 13.7
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Several notebook examples of the use of GEE can be found on the Wiki: Wiki notebooks for GEE
References¶
- KY Liang and S Zeger. “Longitudinal data analysis using generalized linear models”. Biometrika (1986) 73 (1): 13-22.
- S Zeger and KY Liang. “Longitudinal Data Analysis for Discrete and Continuous Outcomes”. Biometrics Vol. 42, No. 1 (Mar., 1986), pp. 121-130
- A Rotnitzky and NP Jewell (1990). “Hypothesis testing of regression parameters in semiparametric generalized linear models for cluster correlated data”, Biometrika, 77, 485-497.
- Xu Guo and Wei Pan (2002). “Small sample performance of the score test in GEE”. http://www.sph.umn.edu/faculty1/wp-content/uploads/2012/11/rr2002-013.pdf
- LA Mancl LA, TA DeRouen (2001). A covariance estimator for GEE with improved small-sample properties. Biometrics. 2001 Mar;57(1):126-34.
Module Reference¶
Model Class¶
GEE (endog, exog, groups[, time, family, ...]) |
Estimation of marginal regression models using Generalized Estimating Equations (GEE). |
Results Classes¶
GEEResults (model, params, cov_params, scale) |
This class summarizes the fit of a marginal regression model using GEE. |
GEEMargins (results, args[, kwargs]) |
Estimated marginal effects for a regression model fit with GEE. |
Dependence Structures¶
The dependence structures currently implemented are
CovStruct ([cov_nearest_method]) |
A base class for correlation and covariance structures of grouped data. |
Autoregressive ([dist_func]) |
A first-order autoregressive working dependence structure. |
Exchangeable () |
An exchangeable working dependence structure. |
GlobalOddsRatio (endog_type) |
Estimate the global odds ratio for a GEE with ordinal or nominal data. |
Independence ([cov_nearest_method]) |
An independence working dependence structure. |
Nested ([cov_nearest_method]) |
A nested working dependence structure. |
Families¶
The distribution families are the same as for GLM, currently implemented are
Family (link, variance) |
The parent class for one-parameter exponential families. |
Binomial ([link]) |
Binomial exponential family distribution. |
Gamma ([link]) |
Gamma exponential family distribution. |
Gaussian ([link]) |
Gaussian exponential family distribution. |
InverseGaussian ([link]) |
InverseGaussian exponential family. |
NegativeBinomial ([link, alpha]) |
Negative Binomial exponential family. |
Poisson ([link]) |
Poisson exponential family. |
Link Functions¶
The link functions are the same as for GLM, currently implemented are the following. Not all link functions are available for each distribution family. The list of available link functions can be obtained by
>>> sm.families.family.<familyname>.links
Link |
A generic link function for one-parameter exponential family. |
CDFLink ([dbn]) |
The use the CDF of a scipy.stats distribution |
CLogLog |
The complementary log-log transform |
Log |
The log transform |
Logit |
The logit transform |
NegativeBinomial ([alpha]) |
The negative binomial link function |
Power ([power]) |
The power transform |
cauchy () |
The Cauchy (standard Cauchy CDF) transform |
cloglog |
The CLogLog transform link function. |
identity () |
The identity transform |
inverse_power () |
The inverse transform |
inverse_squared () |
The inverse squared transform |
log |
The log transform |
logit |
Methods |
nbinom ([alpha]) |
The negative binomial link function. |
probit ([dbn]) |
The probit (standard normal CDF) transform |