Fitting distributions_Mastering Scientific Computing with R-QQ阅读历史男生网

书名：Mastering Scientific Computing with R
作者名：Paul Gerrard Radia M. Johnson
本章字数：1397字
更新时间：2025-02-26 02:13:15

Fitting distributions

Now that we have seen how to plot and gain statistical information from probability distributions, we will show you how to use R to fit your data to theoretical distributions. There are several methods you can use to test whether your sample distribution fits a theoretical distribution. For example, you may want to see if your sample distribution fits a normal distribution using a Quantile-Quantile plot (Q-Q plot). In R, you can use the qqnorm() function to create a Q-Q plot to evaluate the data. R also has a more generic version of this function called qqplot() to create Q-Q plots for any theoretical distribution. To illustrate the use of these functions, let's create a Q-Q plot to test whether the gene expression values of probeA follow a normal or gamma distribution.

First, let's set the plot settings to display two figures in the same layout:

> par(mfrow=c(1,2))

Use qqnorm() to fit the data to a normal distribution:

qqnorm(probeA)

Add the theoretical line for a normal distribution using the default first and third quantiles:

> qqline(probeA, distribution = qnorm,  probs = c(0.25, 0.75))

To examine the fit of our data to a gamma distribution, let's estimate the shape and rate parameters using the fitdistr() function part of the MASS package:

> require("MASS")
> fitdistr(probeA, 'gamma')
     shape       rate   
  341.75868    71.58950
 ( 74.52444) ( 15.62234)

Let's store the gamma parameters in an object for future use:

> gamma.par <- fitdistr(probeA, 'gamma')

To know what is stored in our gamma.par object, we can use the str() function we mentioned earlier in Chapter 1, Programming with R. Let's have a look at the following function:

> str(gamma.par)
List of 5
 $ estimate: Named num [1:2] 341.8 71.6
  ..- attr(*, "names")= chr [1:2] "shape" "rate"
 $ sd      : Named num [1:2] 74.5 15.6
  ..- attr(*, "names")= chr [1:2] "shape" "rate"
 $ vcov    : num [1:2, 1:2] 5554 1163 1163 244
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "shape" "rate"
  .. ..$ : chr [1:2] "shape" "rate"
 $ loglik  : num -2.37
 $ n       : int 42
 - attr(*, "class")= chr "fitdistr"

By inspecting the structure of the gamma.par object, we now know how to access the shape and rate values. We need to use the appropriate index from the gamma.par$estimate parameter named numeric vector or use the name of the index, as follows:

> gamma.par$estimate['shape'] #or gamma.par$estimate[1]  
   shape
341.7587
> s <- gamma.par$estimate['shape']
> r <- gamma.par$estimate['rate']

Now, we can calculate the theoretical quantiles for a gamma distribution with the estimated values for the shape and rate, as shown:

> theoretical.probs <- seq(1:length(probeA))/(length(probeA)+1)
> theoretical.quantiles <- qgamma(theoretical.probs,shape=s,rate=r)
> plot(theoretical.quantiles, sort(probeA),xlab="Theoretical Quantiles",ylab="Sample Quantiles",main="Gamma QQ-plot")

Now, let's add a line to the first and third quantiles using the qqline() function. Since we are not using the default qnorm function for the distribution argument in the qqline() function, we need to create our own settings using the qgamma() function with the appropriate shape and rate settings, as follows:

> qF <- function(p) qgamma(p, shape=s, rate=r)
> qqline(y=sort(probeA), distribution=qF)

The graph for theoretical quantiles of a Normal Q-Q plot and Gamma Q-Q plot is shown in the following diagram:

Before going any further, let's return our plot parameter settings to its default settings using the par() function:

>par(mfrow=c(1,1))

Instead of Q-Q plots, you may choose to use a well-defined statistical test for fitting distributions such as the Kolmogorov-Smirnov, Anderson-Darling, or Chi-square test. For example, let's use the Kolmogorov-Smirnov method to test whether our probeA data fits a gamma distribution with a shape of three and rate of two using the ks.test() function, as follows:

> ks.test(probeA, "pgamma", 3, 2)
One-sample Kolmogorov-Smirnov test
data:  probeA
D = 0.9901, p-value = 2.22e-16
alternative hypothesis: two-sided

Alternatively, we can use the Anderson-Darling test to test whether our probeA dataset fits a normal distribution using the ad.test() function available in the nortest package:

> require("nortest")
> ad.test(probeA)
Anderson-Darling normality test
data:  probeA
A = 0.378, p-value = 0.3923

Higher order moments of a distribution

Higher order moments of a sample distribution are often used to estimate shape parameters, such as skewness and kurtosis, or to measure the deviation of a sample distribution from a normal distribution. For example, we might want to know if the skew we observed in our probeA distribution has a long drawn out tail to the left (negative skewing) or right (positive skewing) compared to a normal curve. You quantify the degree of skewing using the skewness() function available in the fBasics package and test whether the degree of skewing you calculate from your dataset is significantly different from the theoretical value for the skew of a normal distribution, which is zero.

Let's have a look at the following code:

> require("fBasics")
> skewness(probeA)
[1] -0.1468461
attr(,"method")
[1] "moment"

Now, we can determine whether the absolute value of the -0.147 skew is significantly different from zero by performing a t-test. The approximate standard error of a skew is defined as the square root of 6 divided by the total number of samples:

> abs(skewness(probeA))/sqrt(6/length(probeA))
[1] 0.3885183
attr(,"method")
[1] "moment"

Now, we just need to calculate the probability of obtaining a t-value of 0.389 by chance when the skew value is truly zero:

> 1- pt(0.389, 41)
[1] 0.3496446

Since the p-value is greater than 0.05, we cannot reject the null hypothesis and conclude that the skew is not significantly different from zero. Thus, the skew is not significantly different from a normal curve.

Instead of calculating the skew, we can also evaluate the kurtosis value to get an idea of the peak or flat-top nature of the distribution. A normal curve has a kurtosis value of zero and the approximate standard error of kurtosis is calculated by taking the square root of 24 divided by the total number of samples. To obtain the kurtosis value in R, we can use the kurtosis() function, which is also available in the fBasics package:

> kurtosis(probeA)
[1] -0.5670091
attr(,"method")
[1] "excess"
> abs(kurtosis(probeA))/sqrt(24/length(probeA))
[1] 0.7500826
attr(,"method")
[1] "excess"
> 1-pt(0.750, 41)
[1] 0.2287686

Since the p-value is greater than 0.05, we cannot reject the null hypothesis and conclude that the kurtosis of our sample distribution is not significantly different from a normal distribution.

Other statistical tests to fit distributions

R also has many other functions available to help fit your sample distribution to many other types of distributions. For convenience, you can find a list of R functions available to fit distributions in the following table. You can also consult Vito Ricci's document on fitting distributions at http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf for more information on how to use these functions to fit distributions.

The propagate package

You may also find the propagate package particularly useful because it allows you to fit your data to many distributions at once and tells you which one is the most appropriate. For example, we can use the fitDistr() function to fit your data to a variety of distributions. The fits for the data will be sorted by ascending Akaike information criterion (AIC) value, where the preferred model is the one with the minimum AIC value. The AIC value takes into account the trade-off between the goodness-of-fit of the model and the complexity of the model. Let's have a look at the following example:

> install.packages("propagate")
> library("propagate")

The fitDistr() function allows you to use a vector of observation or an object generated using the propagate() function. In our first example, let's consider a simple example using the rnorm() function to simulate 10,000 numerical observations. Let's have a look at the following example:

> set.seed(275) #so you get the same results
> observations <- rnorm(10000, 5)
> distTested <- fitDistr(observations)
Fitting Normal distribution...Done.
Fitting Skewed-normal distribution...Done.
Fitting Generalized normal distribution...........10.........20.......Done.
Fitting Log-normal distribution...Done.
Fitting Scaled/shifted t- distribution...Done.
Fitting Logistic distribution...Done.
Fitting Uniform distribution...Done.
Fitting Triangular distribution...Done.
Fitting Trapezoidal distribution...Done.
Fitting Curvilinear Trapezoidal distribution...Done.
Fitting Generalized Trapezoidal distribution...Done.
Fitting Gamma distribution...Done.
Fitting Cauchy distribution...Done.
Fitting Laplace distribution...Done.
Fitting Gumbel distribution...Done.
Fitting Johnson SU distribution...........10.........20.........30.........40.........50
.........60.........70.........80.Done.
Fitting Johnson SB distribution...........10.........20.........30.........40.........50
.........60.........70.........80.Done.
Fitting 3P Weibull distribution...........10.........20.......Done.
Fitting 4P Beta distribution...Done.
Fitting Arcsine distribution...Done.
Fitting von Mises distribution...Done.

As you can see, the fitDistr() function automatically plots the best distribution based on the AIC value. To disable this feature, you can set plot=FALSE value. The plot is shown in the following graph:

You can see the AIC values for the other distributions by accessing the aic data frame of the distTested object. Let's have a look at the following example:

> distTested$aic
              Distribution        AIC
2            Skewed-normal -1056.7637
16              Johnson SU -1055.2898
3       Generalized normal -1055.1193
17              Johnson SB -1053.0655
19                 4P Beta -1051.5507
11 Generalized Trapezoidal -1049.0981
1                   Normal -1047.0616
5        Scaled/shifted t- -1046.9436
18              3P Weibull -1005.1263
12                   Gamma  -998.3562
6                 Logistic  -984.4944
21               von Mises  -947.3471
8               Triangular  -941.3394
9              Trapezoidal  -940.3361
4               Log-normal  -929.9566
15                  Gumbel  -809.9676
14                 Laplace  -733.8516
13                  Cauchy  -650.7236
10 Curvilinear Trapezoidal  -581.6016
7                  Uniform  -391.3259
20                 Arcsine  -261.7417

Alternatively, you can use the fitDistr() function to fit results obtained from the propagate() function. Consider the following example:

First, we enter the equation in R using the expression() function and the mean and standard deviation for the x and y variables into separate numeric vectors, as follows:

> EXPR <- expression(x^(3 * y)-1)
> x <- c(6, 0.1)
> y <- c(2, 0.1)

Then, we create a matrix that stores the means in the first row and the standard deviations in the second row for each variable, as follows:

> DF <- cbind(x, y)

Next, we use the propagate() function to generate the kernel density of the Monte Carlo simulation results to be used in the fitDistr() function, as follows:

> RES <- propagate(expr = EXPR, data = DF, type = "stat", do.sim = TRUE, verbose = TRUE)

Now, we can fit our data to the 21 distributions, as shown in the following code, with the fitDistr() function:

> testedDistrEXPR <- fitDistr(RES)
Fitting Normal distribution...Done.
Fitting Skewed-normal distribution...Done.
Fitting Generalized normal distribution...........10.........20.......Done.
Fitting Log-normal distribution...Done.
Fitting Scaled/shifted t- distribution...Done.
Fitting Logistic distribution...Done.
Fitting Uniform distribution...Done.
Fitting Triangular distribution...Done.
Fitting Trapezoidal distribution...Done.
Fitting Curvilinear Trapezoidal distribution...Done.
Fitting Generalized Trapezoidal distribution...Done.
Fitting Gamma distribution...Done.
Fitting Cauchy distribution...Done.
Fitting Laplace distribution...Done.
Fitting Gumbel distribution...Done.
Fitting Johnson SU distribution...........10.........20.........30.........40.........50
.........60.........70.........80.Done.
Fitting Johnson SB distribution...........10.........20.........30.........40.........50
.........60.........70.........80.Done.
Fitting 3P Weibull distribution...........10.........20.......Done.
Fitting 4P Beta distribution...Done.
Fitting Arcsine distribution...Done.
Fitting von Mises distribution...Done.

The plot generated for the best distribution fitted is shown in the following graph:

The AIC values for all the distributions tested are as follows:

> testedDistrEXPR$aic
              Distribution       AIC
3       Generalized normal -6682.103
16              Johnson SU -6680.099
4               Log-normal -6670.469
2            Skewed-normal -5847.043
18              3P Weibull -5691.254
12                   Gamma -5667.074
15                  Gumbel -5611.167
9              Trapezoidal -5541.594
5        Scaled/shifted t- -5284.617
6                 Logistic -5274.231
1                   Normal -5256.894
14                 Laplace -5220.417
13                  Cauchy -5216.931
11 Generalized Trapezoidal -5105.354
20                 Arcsine -4674.931
7                  Uniform -4661.630
8               Triangular -4661.388
10 Curvilinear Trapezoidal -4603.581
21               von Mises -4603.451
17              Johnson SB -4602.456
19                 4P Beta -4599.451