Intro to GLM: Calculating Pure Premium with SAS
Data Description
These data were compiled by the Swedish Committee on the Analysis of Risk Premium in Motor Insurance, summarized in Hallin and Ingenbleek (1983) and Andrews and Herzberg (1985). The data are cross-sectional, describing third party automobile insurance claims for the year 1977.
The outcomes of interest are the number of claims (the frequency) and sum of payments (the severity), in Swedish kroners. Outcomes are based on 5 categories of distance driven by a vehicle, broken down by 7 geographic zones, 7 categories of recent driver claims experience and 9 types of automobile. Even though there are 2,205 potential distance, zone, experience and type combinations (5 x 7 x 7 x 9 = 2,205), only n = 2,182 were realized in the 1977 data set.
File Name: SwedishMotorInsurance
Number of obs: 2182
Number of variables: 7
| Variable | Description |
|---|---|
| Kilometres | Distance driven by a vehicle, grouped into five categories |
| Zone | Graphic zone of a vehicle, grouped into 7 categories |
| Bonus | Driver claim experience, grouped into 7 categories |
| Make | The type of a vehicle |
| Insured | The number of policyholder years. A “policyholder year” is the fraction of the year that the policyholder has a contract with the issuing company. |
| Claims | Number of claims |
| Payment | Sum of payments |
Source: Hallin and Ingenbleek (1983) and Andrews and Herzberg (1985).
Example of the first five observations:
| Kilometres | Zone | Bonus | Make | Insured | Claims | Payment |
|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 455.13 | 108 | 392491 |
| 1 | 1 | 2 | 1 | 69.17 | 19 | 46221 |
| 1 | 1 | 3 | 1 | 72.88 | 13 | 15694 |
| 1 | 1 | 4 | 4 | 1292.39 | 124 | 422201 |
| 1 | 1 | 5 | 1 | 191.01 | 40 | 119373 |
Reference: Frees, Edward W. Regression modeling with actuarial and financial applications. Cambridge University Press, 2009. Swedish Automobile Claims
Input Data
PROC IMPORT Step:
DATAFILE=REFFILEspecifies the path and filename of the CSV file to be imported.DBMS=CSVspecifies that the file type to be imported is in CSV format.OUT=glm.swespecifies the name of the dataset to be created after import.GETNAMES=YESspecifies that the first row of the imported CSV file contains the variable names.RUN;executes thePROC IMPORTstep.
PROC CONTENTS is used to view the variable names, types, lengths, etc., of the imported dataset “glm.swe”.
1/* Specify REFFILE as the path to the CSV file to be imported */
2FILENAME REFFILE '/[file path]/SwedishMotorInsurance.csv';
3
4PROC IMPORT DATAFILE=REFFILE
5 DBMS=CSV
6 OUT=glm.swe;
7 GETNAMES=YES;
8RUN;
9
10PROC CONTENTS DATA=glm.swe; RUN;
Fitting Claim Frequency and Severity Separately
Pure premium = Claim frequency * Claim severity
Claim frequency = Number of claims / Earned exposure
Claim severity = Total payments / Number of claims
Create new variables frequency and severity as target variables for the GLM model.
1data glm.swe;
2set glm.swe;
3severity=Payment/Claims;
4frequency=Claims/Insured;
5run;
Claim Frequency Poisson Model
data=glm.swe; specifies the dataset.
class kilometres(ref="2") zone(ref="1") bonus make; specifies the categorical variables. SAS by default uses the maximum value (e.g., Z) as the base level (reference level) in descending order. It should use the level with the most observations as the base. If the default base is not appropriate, use ref to force specification.
model frequency = kilometres zone bonus make specifies frequency as the dependent variable, and kilometres, zone, bonus, and make as independent variables.
/dist=poisson link=log; uses the Poisson distribution and log link function.
ods output parameterestimates=glm.frequency; outputs the parameter estimates to the specified dataset. In SAS, ODS stands for “Output Delivery System”, which is used to control and manage the format, location, and style of output results. ODS allows SAS processes and outputs to be output in various formats, such as HTML, PDF, RTF, Excel, etc.
weight insured; specifies using the insured variable as the weight.
run; executes the proc genmod step.
1proc genmod data=glm.swe;
2class kilometres(ref="2") zone(ref="1") bonus make;
3model frequency = kilometres zone bonus make/dist=poisson link=log;
4ods output parameterestimates=glm.frequency;
5weight insured;
6run;
R makes things simple
Specify reference level:
1swe$Kilometres <- relevel(swe$Kilometres, ref = "2")
Fit the model, the output should be consistent with SAS.
1freq <- glm(frequency ~ Kilometres + Zone + Bonus + Make,
2 family = poisson, data = swe, weights = Insured)
Claim Severity Gamma Model
Use the same statements as the previous model, but replace the target variable with severity, and specify the distribution as gamma.
weight claims; specifies using the claims variable as the weight.
1proc genmod data=glm.swe;
2class kilometres(ref="2") zone(ref="1") bonus make;
3model severity = kilometres zone bonus make/dist=gamma link=log;
4ods output parameterestimates=glm.severity;
5weight claims;
6run;
R makes things simple
The canonical link for the Gamma distribution is the inverse function, but here we use the log link function.
The log link function is typically used for handling positive response variables, such as count data or severity levels.
The inverse link function is typically used for handling the reciprocals of positive response variables, such as certain measured time intervals. If the inverse link is chosen here, the magnitude of the parameters would be very small.
1sev <- glm(severity ~ Kilometres + Zone + Bonus + Make,
2 family = Gamma(link = "log"), data = swe, weights = Claims)