Intro to GLM: Calculating Pure Premium with SAS

2023-12-21 834 words 4 minutes

Contents

Data Description

These data were compiled by the Swedish Committee on the Analysis of Risk Premium in Motor Insurance, summarized in Hallin and Ingenbleek (1983) and Andrews and Herzberg (1985). The data are cross-sectional, describing third party automobile insurance claims for the year 1977.

The outcomes of interest are the number of claims (the frequency) and sum of payments (the severity), in Swedish kroners. Outcomes are based on 5 categories of distance driven by a vehicle, broken down by 7 geographic zones, 7 categories of recent driver claims experience and 9 types of automobile. Even though there are 2,205 potential distance, zone, experience and type combinations (5 x 7 x 7 x 9 = 2,205), only n = 2,182 were realized in the 1977 data set.

File Name: SwedishMotorInsurance

Number of obs: 2182

Number of variables: 7

Variable	Description
Kilometres	Distance driven by a vehicle, grouped into five categories
Zone	Graphic zone of a vehicle, grouped into 7 categories
Bonus	Driver claim experience, grouped into 7 categories
Make	The type of a vehicle
Insured	The number of policyholder years. A “policyholder year” is the fraction of the year that the policyholder has a contract with the issuing company.
Claims	Number of claims
Payment	Sum of payments

Source: Hallin and Ingenbleek (1983) and Andrews and Herzberg (1985).

Example of the first five observations:

Kilometres	Zone	Bonus	Make	Insured	Claims	Payment
1	1	1	1	455.13	108	392491
1	1	2	1	69.17	19	46221
1	1	3	1	72.88	13	15694
1	1	4	4	1292.39	124	422201
1	1	5	1	191.01	40	119373

Reference: Frees, Edward W. Regression modeling with actuarial and financial applications. Cambridge University Press, 2009. Swedish Automobile Claims

Input Data

PROC IMPORT Step:

DATAFILE=REFFILE specifies the path and filename of the CSV file to be imported.
DBMS=CSV specifies that the file type to be imported is in CSV format.
OUT=glm.swe specifies the name of the dataset to be created after import.
GETNAMES=YES specifies that the first row of the imported CSV file contains the variable names.
RUN; executes the PROC IMPORT step.

PROC CONTENTS is used to view the variable names, types, lengths, etc., of the imported dataset “glm.swe”.

 1/* Specify REFFILE as the path to the CSV file to be imported */
 2FILENAME REFFILE '/[file path]/SwedishMotorInsurance.csv';
 3
 4PROC IMPORT DATAFILE=REFFILE
 5    DBMS=CSV
 6    OUT=glm.swe;
 7    GETNAMES=YES;
 8RUN;
 9
10PROC CONTENTS DATA=glm.swe; RUN;

Fitting Claim Frequency and Severity Separately

Pure premium = Claim frequency * Claim severity

Claim frequency = Number of claims / Earned exposure

Claim severity = Total payments / Number of claims

Create new variables frequency and severity as target variables for the GLM model.

1data glm.swe;
2set glm.swe;
3severity=Payment/Claims;
4frequency=Claims/Insured;
5run;

Claim Frequency Poisson Model

data=glm.swe; specifies the dataset.

class kilometres(ref="2") zone(ref="1") bonus make; specifies the categorical variables. SAS by default uses the maximum value (e.g., Z) as the base level (reference level) in descending order. It should use the level with the most observations as the base. If the default base is not appropriate, use ref to force specification.

model frequency = kilometres zone bonus make specifies frequency as the dependent variable, and kilometres, zone, bonus, and make as independent variables.

/dist=poisson link=log; uses the Poisson distribution and log link function.

ods output parameterestimates=glm.frequency; outputs the parameter estimates to the specified dataset. In SAS, ODS stands for “Output Delivery System”, which is used to control and manage the format, location, and style of output results. ODS allows SAS processes and outputs to be output in various formats, such as HTML, PDF, RTF, Excel, etc.

weight insured; specifies using the insured variable as the weight.

run; executes the proc genmod step.

1proc genmod data=glm.swe;
2class kilometres(ref="2") zone(ref="1") bonus make;
3model frequency = kilometres zone bonus make/dist=poisson link=log;
4ods output parameterestimates=glm.frequency;
5weight insured;
6run;

R makes things simple

Specify reference level:

1swe$Kilometres <- relevel(swe$Kilometres, ref = "2")

Fit the model, the output should be consistent with SAS.

1freq <- glm(frequency ~ Kilometres + Zone + Bonus + Make, 
2    family = poisson, data = swe, weights = Insured)

Claim Severity Gamma Model

Use the same statements as the previous model, but replace the target variable with severity, and specify the distribution as gamma.

weight claims; specifies using the claims variable as the weight.

1proc genmod data=glm.swe;
2class kilometres(ref="2") zone(ref="1") bonus make;
3model severity = kilometres zone bonus make/dist=gamma link=log;
4ods output parameterestimates=glm.severity;
5weight claims;
6run;

R makes things simple

The canonical link for the Gamma distribution is the inverse function, but here we use the log link function.

The log link function is typically used for handling positive response variables, such as count data or severity levels.

The inverse link function is typically used for handling the reciprocals of positive response variables, such as certain measured time intervals. If the inverse link is chosen here, the magnitude of the parameters would be very small.

1sev <- glm(severity ~ Kilometres + Zone + Bonus + Make, 
2    family = Gamma(link = "log"), data = swe, weights = Claims)