/* Program taken from the UCLA Computing website http://www.ats.ucla.edu/stat/sas/dae/poissonreg.htm and slightly modified by me. */ /* The data constitute attendance data on 316 high school juniors from two urban high schools in the file poissonreg.csv. The response variable of interest is days absent, daysabs. The variabes math and langarts give the standardized test scores for math and language arts, respectively. The variable male isa binary indicator of student gender. */ data poissonreg; infile "c:\data\classdat\poisson\poissonreg.csv" delimiter=',' firstobs=2; input id school male math langarts daysatt daysabs; run; proc means data = poissonreg mean std min max var; var daysabs math langarts male; run; proc univariate data = poissonreg noprint; histogram daysabs / midpoints = 0 to 50 by 1 vscle = count; run; proc freq data = poissonreg; tables male; run; /* Here we assume the count data implies that the mean and variance of the counts are the same and thus use the poisson distribution (equidispersion). Of course this should be formally tested. */ proc genmod data = poissonreg; model daysabs = male math langarts /dist=poisson; run; /* Here we apply the negative binomial distribution. We would use this one if we have the overdispersion case where the variance of the counts is greater than the mean of the counts. */ proc genmod data = poissonreg; model daysabs = male math langarts /dist=negbin; run; /* In the case that there are more zeros than wouldb eb expected by either a Poisson model or negative binomial model, we could use the Zero-inflated Regression Model which unfortunately is not supported by Proc Genmod. */ /* Just to be on the safe side, let's rerun proc genmod and the poisson method with the "repeated" statement in order to obtain robust standard errors for the Poisson regression coefficients. */ proc genmod data = poissonreg; class id; model daysabs = male math langarts /dist=poisson; repeated subject=id / type=cs; run; /* The robust standard errors attempt to adjust for heterogeneity in the model. Using the robust standard errors has resulted in a fairly large change in the standard errors, which should be more appropriate. The z-tests still yield similar significant results, but give more realistic p-values. /* /* The variable math was border-line significant without the "repeated" statement and is clearly not significant with it. Since math is not significant in the model with robust standard errors, we will rerun the model dropping that variable. */ proc genmod data = poissonreg; class id; model daysabs = male langarts /dist=poisson; repeated subject=id / type=cs; run; /* The model fits the data significantly better than the null model, i.e. the intercept- only model. To show that this is the case, we can run the null model and compare the null model with the current model using chi-squared test on the difference of log likelihood. */ proc genmod data = poissonreg; class id; model daysabs = / type3 dist=poisson; repeated subject=id / type=cs; run; quit; /* The log likelihood for the full model is 1480.3813 and is 1394 for the null model. The chi-squared vale is 2*(1480.3813 - 1394.6299) = 171.5028. Since we have two perdictor variables in the full model, the degrees of freedom for the chi-squared test is 2. This yields a p-value < 0.0001. */ /* Finally, we will use the estimate statement to get the predicted change in days absent for male and female group when the langarts is held at its mean. */ proc genmod data = poissonreg; class id; model daysabs = male langarts /dist=poisson; repeated subject=id / type=cs; estimate "male" langarts 50.0637938 male 1 intercept 1 / exp; estimate "female" langarts 50.0637938 male 0 intercept 1 / exp; run; /* The Poisson regression model predicting days absent from school stay from language arts and gender was statistically significant with likelihood ratio chi-squre = 171.503, df=2 yielding p-value < 0.0001. The predictors langarts and male were each statistically significant. For these data, the expected change in log count fro a one-unit increas in languare arts was -0.0146. Male students had an expected log count 0.41 less than female students. */