Monte Carlo simulations are most commonly used to understand the properties of a particular statistic such as the mean, or an estimator like maximum likelihood (ML) regression methods.
The principal is straight forward. Create a data set with a known correlation or covariance structure. Then add in some random error, and estimate your statistic or model.
Replicate this process 1,000 or 10,000 times – collecting the relevant information from each trial – and you’ll have a nice sampling distribution with which to evaluate the properties of your model or statistic.
The replication can be accomplished easily enough with a
In this article, you’ll find out how to accomplish the other part of the task: creating a data set with a known correlation structure.
Create the Correlation Matrix
The first think you need to do to create your data set, is decide what you want the correlation or covariance matrix to look like.
I’m going to create a correlation matrix here, since that seems to be easier for most people to think about. Creating a covariance matrix uses the same
-matrix- command here, but will require an extra command option later on during the procedure.
Let’s make the following square correlation matrix:
1 .3 .4 .3 1 0 .4 0 1
To do this, you need to use the
-matrix input- command as such:
clear matrix input m = (1, .3, .4\ .3, 1, 0\ .4, 0, 1) matrix list m
In this command, we name the matrix
m for later reference. Also notice that column elements are separated by commas (e.g. 1, .3, .4), and rows are separated by backslashes (\). You can leave out the
-input- command if you want and just use
-matrix m-, but for programs that use a lot of matrix commands, I prefer to keep the input statement.
-matrix list- command will display the contents of
m afterward to verify the result, seen in the following figure:
Convert the Correlation Matrix to Data
Now that you have a correlation matrix created, we need to convert this correlation matrix into usable data points. I do that with the following code:
set seed 12345 set obs 100 corr2data a b c, corr(m)
The first line (
-set seed-)sets my standard random number seed so you can replicate the results shown here. The next line (
-set obs-) defines a data set with 100 observations. The number of observations need to be defined before we convert the correlation matrix so Stata will know how many data points to create.
The last line above (
-corr2data-) is the critical command in this process. This command tells Stata to make three random normal variates, named a, b, and c. The
-corr()- option tells Stata to define these variables using the correlation structure in matrix
If you want to use a covariance matrix instead of a correlation matrix, creating the matrix uses the same steps. The only difference is that you need to use the
-cov()- option instead of
Once you convert the matrix
m into a data set using
-corr2data-, you can verify the results with the following commands:
list in 1/10, sep(0) corr a b c
Your output should look like this now:
Do I Need to Use a Square Matrix?
It can take some time to enter the commands to make a square matrix, since you are entering nearly all of the values in twice. It would certainly be nicer if you could simply provide an upper or lower triangular matrix.
-corr2data- allows for these options with the
-cstorage()- option. Here’s how to do it.
Begin by creating a new matrix object:
clear matrix input n = (1, .3, 1, .4, 0, 1) matrix list n
Notice that the matrix
n has the same correlation structure as
m, but no longer has the row separators. Instead, we simply define the lower triangle using a row vector.
Now you can convert the row vector into data, and tell Stata that the vector represent a lower triangular matrix using the following command:
set seed 12345 set obs 100 corr2data d e f, corr(n) cstorage(lower)
Again, we set the random number seed to get identical results as before, and define 100 observations. The
-corr2data- command instructs Stata to create three variables, d, e, and f, using the correlation matrix
However, now we include an additional option:
-cstorage(lower)-. This option let’s Stata know that the row vector
n represents a lower triangle matrix with the following form:
1 .3 1 .4 0 1
If for some reason you have a penchant for upper triangular matrices, you can use those with the
-cstorage(upper)- option in
-corr2data-. However, be aware that you would need to create the initial row vector with the following matrix command:
matrix input o = (1, .3, .4, 1, 0, 1)
Gimme the Code!
Once you’ve got your random variates with a known correlation structure created, you can convert their formats to match your needs.
To make it easier for you to experiment with, the complete code for creating data using a square, lower triangle, or upper triangle matrix is given below.
// Example 1 // Create data set with known correlation structure // Use a square matrix clear matrix input m = (1, .3, .4\ .3, 1, 0\ .4, 0, 1) matrix list m set seed 12345 set obs 100 corr2data a b c, corr(m) // show data and verify correlation list in 1/10, sep(0) corr a b c // Example 2 // Create data set with known correlation structure // Use a lower triangle matrix...note the '\' are removed clear matrix input n = (1, .3, 1, .4, 0, 1) matrix list n set seed 12345 set obs 100 corr2data d e f, corr(n) cstorage(lower) // show data and verify correlation list in 1/10, sep(0) corr d e f // Example 3 // Create data set with known correlation structure // Use an upper triangle matrix...note the '\' are removed clear matrix input o = (1, .3, .4, 1, 0, 1) matrix list o set seed 12345 set obs 100 corr2data g h i, corr(o) cstorage(upper) // show data and verify correlation list in 1/10, sep(0) corr g h i