- Mastering Scientific Computing with R
- Paul Gerrard Radia M. Johnson
- 1397字
- 2025-02-26 02:13:15
Fitting distributions
Now that we have seen how to plot and gain statistical information from probability distributions, we will show you how to use R to fit your data to theoretical distributions. There are several methods you can use to test whether your sample distribution fits a theoretical distribution. For example, you may want to see if your sample distribution fits a normal distribution using a Quantile-Quantile plot (Q-Q plot). In R, you can use the qqnorm()
function to create a Q-Q plot to evaluate the data. R also has a more generic version of this function called qqplot()
to create Q-Q plots for any theoretical distribution. To illustrate the use of these functions, let's create a Q-Q plot to test whether the gene expression values of probeA
follow a normal or gamma distribution.
First, let's set the plot settings to display two figures in the same layout:
> par(mfrow=c(1,2))
Use qqnorm()
to fit the data to a normal distribution:
qqnorm(probeA)
Add the theoretical line for a normal distribution using the default first and third quantiles:
> qqline(probeA, distribution = qnorm, probs = c(0.25, 0.75))
To examine the fit of our data to a gamma distribution, let's estimate the shape and rate parameters using the fitdistr()
function part of the MASS
package:
> require("MASS") > fitdistr(probeA, 'gamma') shape rate 341.75868 71.58950 ( 74.52444) ( 15.62234)
Let's store the gamma parameters in an object for future use:
> gamma.par <- fitdistr(probeA, 'gamma')
To know what is stored in our gamma.par
object, we can use the str()
function we mentioned earlier in Chapter 1, Programming with R. Let's have a look at the following function:
> str(gamma.par) List of 5 $ estimate: Named num [1:2] 341.8 71.6 ..- attr(*, "names")= chr [1:2] "shape" "rate" $ sd : Named num [1:2] 74.5 15.6 ..- attr(*, "names")= chr [1:2] "shape" "rate" $ vcov : num [1:2, 1:2] 5554 1163 1163 244 ..- attr(*, "dimnames")=List of 2 .. ..$ : chr [1:2] "shape" "rate" .. ..$ : chr [1:2] "shape" "rate" $ loglik : num -2.37 $ n : int 42 - attr(*, "class")= chr "fitdistr"
By inspecting the structure of the gamma.par
object, we now know how to access the shape and rate values. We need to use the appropriate index from the gamma.par$estimate
parameter named numeric vector or use the name of the index, as follows:
> gamma.par$estimate['shape'] #or gamma.par$estimate[1] shape 341.7587 > s <- gamma.par$estimate['shape'] > r <- gamma.par$estimate['rate']
Now, we can calculate the theoretical quantiles for a gamma distribution with the estimated values for the shape and rate, as shown:
> theoretical.probs <- seq(1:length(probeA))/(length(probeA)+1) > theoretical.quantiles <- qgamma(theoretical.probs,shape=s,rate=r) > plot(theoretical.quantiles, sort(probeA),xlab="Theoretical Quantiles",ylab="Sample Quantiles",main="Gamma QQ-plot")
Now, let's add a line to the first and third quantiles using the qqline()
function. Since we are not using the default qnorm
function for the distribution
argument in the qqline()
function, we need to create our own settings using the qgamma()
function with the appropriate shape and rate settings, as follows:
> qF <- function(p) qgamma(p, shape=s, rate=r) > qqline(y=sort(probeA), distribution=qF)
The graph for theoretical quantiles of a Normal Q-Q plot and Gamma Q-Q plot is shown in the following diagram:

Before going any further, let's return our plot parameter settings to its default settings using the par()
function:
>par(mfrow=c(1,1))
Instead of Q-Q plots, you may choose to use a well-defined statistical test for fitting distributions such as the Kolmogorov-Smirnov, Anderson-Darling, or Chi-square test. For example, let's use the Kolmogorov-Smirnov method to test whether our probeA
data fits a gamma distribution with a shape of three and rate of two using the ks.test()
function, as follows:
> ks.test(probeA, "pgamma", 3, 2) One-sample Kolmogorov-Smirnov test data: probeA D = 0.9901, p-value = 2.22e-16 alternative hypothesis: two-sided
Alternatively, we can use the Anderson-Darling test to test whether our probeA
dataset fits a normal distribution using the ad.test()
function available in the nortest
package:
> require("nortest") > ad.test(probeA) Anderson-Darling normality test data: probeA A = 0.378, p-value = 0.3923
Higher order moments of a distribution
Higher order moments of a sample distribution are often used to estimate shape parameters, such as skewness and kurtosis, or to measure the deviation of a sample distribution from a normal distribution. For example, we might want to know if the skew we observed in our probeA
distribution has a long drawn out tail to the left (negative skewing) or right (positive skewing) compared to a normal curve. You quantify the degree of skewing using the skewness()
function available in the fBasics
package and test whether the degree of skewing you calculate from your dataset is significantly different from the theoretical value for the skew of a normal distribution, which is zero.
Let's have a look at the following code:
> require("fBasics") > skewness(probeA) [1] -0.1468461 attr(,"method") [1] "moment"
Now, we can determine whether the absolute value of the -0.147 skew is significantly different from zero by performing a t-test. The approximate standard error of a skew is defined as the square root of 6 divided by the total number of samples:
> abs(skewness(probeA))/sqrt(6/length(probeA)) [1] 0.3885183 attr(,"method") [1] "moment"
Now, we just need to calculate the probability of obtaining a t-value of 0.389 by chance when the skew value is truly zero:
> 1- pt(0.389, 41) [1] 0.3496446
Since the p-value
is greater than 0.05, we cannot reject the null hypothesis and conclude that the skew is not significantly different from zero. Thus, the skew is not significantly different from a normal curve.
Instead of calculating the skew, we can also evaluate the kurtosis value to get an idea of the peak or flat-top nature of the distribution. A normal curve has a kurtosis value of zero and the approximate standard error of kurtosis is calculated by taking the square root of 24 divided by the total number of samples. To obtain the kurtosis value in R, we can use the kurtosis()
function, which is also available in the fBasics
package:
> kurtosis(probeA) [1] -0.5670091 attr(,"method") [1] "excess" > abs(kurtosis(probeA))/sqrt(24/length(probeA)) [1] 0.7500826 attr(,"method") [1] "excess" > 1-pt(0.750, 41) [1] 0.2287686
Since the p-value
is greater than 0.05, we cannot reject the null hypothesis and conclude that the kurtosis of our sample distribution is not significantly different from a normal distribution.
Other statistical tests to fit distributions
R also has many other functions available to help fit your sample distribution to many other types of distributions. For convenience, you can find a list of R functions available to fit distributions in the following table. You can also consult Vito Ricci's document on fitting distributions at http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf for more information on how to use these functions to fit distributions.

You may also find the propagate
package particularly useful because it allows you to fit your data to many distributions at once and tells you which one is the most appropriate. For example, we can use the fitDistr()
function to fit your data to a variety of distributions. The fits for the data will be sorted by ascending Akaike information criterion (AIC) value, where the preferred model is the one with the minimum AIC value. The AIC value takes into account the trade-off between the goodness-of-fit of the model and the complexity of the model. Let's have a look at the following example:
> install.packages("propagate") > library("propagate")
The fitDistr()
function allows you to use a vector of observation or an object generated using the propagate()
function. In our first example, let's consider a simple example using the rnorm()
function to simulate 10,000 numerical observations. Let's have a look at the following example:
> set.seed(275) #so you get the same results > observations <- rnorm(10000, 5) > distTested <- fitDistr(observations) Fitting Normal distribution...Done. Fitting Skewed-normal distribution...Done. Fitting Generalized normal distribution...........10.........20.......Done. Fitting Log-normal distribution...Done. Fitting Scaled/shifted t- distribution...Done. Fitting Logistic distribution...Done. Fitting Uniform distribution...Done. Fitting Triangular distribution...Done. Fitting Trapezoidal distribution...Done. Fitting Curvilinear Trapezoidal distribution...Done. Fitting Generalized Trapezoidal distribution...Done. Fitting Gamma distribution...Done. Fitting Cauchy distribution...Done. Fitting Laplace distribution...Done. Fitting Gumbel distribution...Done. Fitting Johnson SU distribution...........10.........20.........30.........40.........50 .........60.........70.........80.Done. Fitting Johnson SB distribution...........10.........20.........30.........40.........50 .........60.........70.........80.Done. Fitting 3P Weibull distribution...........10.........20.......Done. Fitting 4P Beta distribution...Done. Fitting Arcsine distribution...Done. Fitting von Mises distribution...Done.
As you can see, the fitDistr()
function automatically plots the best distribution based on the AIC value. To disable this feature, you can set plot=FALSE
value. The plot is shown in the following graph:

You can see the AIC values for the other distributions by accessing the aic
data frame of the distTested
object. Let's have a look at the following example:
> distTested$aic Distribution AIC 2 Skewed-normal -1056.7637 16 Johnson SU -1055.2898 3 Generalized normal -1055.1193 17 Johnson SB -1053.0655 19 4P Beta -1051.5507 11 Generalized Trapezoidal -1049.0981 1 Normal -1047.0616 5 Scaled/shifted t- -1046.9436 18 3P Weibull -1005.1263 12 Gamma -998.3562 6 Logistic -984.4944 21 von Mises -947.3471 8 Triangular -941.3394 9 Trapezoidal -940.3361 4 Log-normal -929.9566 15 Gumbel -809.9676 14 Laplace -733.8516 13 Cauchy -650.7236 10 Curvilinear Trapezoidal -581.6016 7 Uniform -391.3259 20 Arcsine -261.7417
Alternatively, you can use the fitDistr()
function to fit results obtained from the propagate()
function. Consider the following example:



First, we enter the equation in R using the expression()
function and the mean and standard deviation for the x and y variables into separate numeric vectors, as follows:
> EXPR <- expression(x^(3 * y)-1) > x <- c(6, 0.1) > y <- c(2, 0.1)
Then, we create a matrix that stores the means in the first row and the standard deviations in the second row for each variable, as follows:
> DF <- cbind(x, y)
Next, we use the propagate()
function to generate the kernel density of the Monte Carlo simulation results to be used in the fitDistr()
function, as follows:
> RES <- propagate(expr = EXPR, data = DF, type = "stat", do.sim = TRUE, verbose = TRUE)
Now, we can fit our data to the 21 distributions, as shown in the following code, with the fitDistr()
function:
> testedDistrEXPR <- fitDistr(RES) Fitting Normal distribution...Done. Fitting Skewed-normal distribution...Done. Fitting Generalized normal distribution...........10.........20.......Done. Fitting Log-normal distribution...Done. Fitting Scaled/shifted t- distribution...Done. Fitting Logistic distribution...Done. Fitting Uniform distribution...Done. Fitting Triangular distribution...Done. Fitting Trapezoidal distribution...Done. Fitting Curvilinear Trapezoidal distribution...Done. Fitting Generalized Trapezoidal distribution...Done. Fitting Gamma distribution...Done. Fitting Cauchy distribution...Done. Fitting Laplace distribution...Done. Fitting Gumbel distribution...Done. Fitting Johnson SU distribution...........10.........20.........30.........40.........50 .........60.........70.........80.Done. Fitting Johnson SB distribution...........10.........20.........30.........40.........50 .........60.........70.........80.Done. Fitting 3P Weibull distribution...........10.........20.......Done. Fitting 4P Beta distribution...Done. Fitting Arcsine distribution...Done. Fitting von Mises distribution...Done.
The plot generated for the best distribution fitted is shown in the following graph:

The AIC values for all the distributions tested are as follows:
> testedDistrEXPR$aic Distribution AIC 3 Generalized normal -6682.103 16 Johnson SU -6680.099 4 Log-normal -6670.469 2 Skewed-normal -5847.043 18 3P Weibull -5691.254 12 Gamma -5667.074 15 Gumbel -5611.167 9 Trapezoidal -5541.594 5 Scaled/shifted t- -5284.617 6 Logistic -5274.231 1 Normal -5256.894 14 Laplace -5220.417 13 Cauchy -5216.931 11 Generalized Trapezoidal -5105.354 20 Arcsine -4674.931 7 Uniform -4661.630 8 Triangular -4661.388 10 Curvilinear Trapezoidal -4603.581 21 von Mises -4603.451 17 Johnson SB -4602.456 19 4P Beta -4599.451