Simulating Data with A Known Correlation Structure in Stata

Monte Carlo simulations are most commonly used to understand the properties of a particular statistic such as the mean, or an estimator like maximum likelihood (ML) regression methods.

The principal is straight forward. Create a data set with a known correlation or covariance structure. Then add in some random error, and estimate your statistic or model.

Replicate this process 1,000 or 10,000 times – collecting the relevant information from each trial – and you’ll have a nice sampling distribution with which to evaluate the properties of your model or statistic.

The replication can be accomplished easily enough with a -forvalues- loop.

In this article, you’ll find out how to accomplish the other part of the task: creating a data set with a known correlation structure.

Create the Correlation Matrix

The first think you need to do to create your data set, is decide what you want the correlation or covariance matrix to look like.

I’m going to create a correlation matrix here, since that seems to be easier for most people to think about. Creating a covariance matrix uses the same -matrix- command here, but will require an extra command option later on during the procedure.

Let’s make the following square correlation matrix:

1	.3	.4

.3	1	0

.4	0	1

To do this, you need to use the -matrix input- command as such:

clear
matrix input m = (1, .3, .4\  .3, 1, 0\ .4, 0, 1)
matrix list m

In this command, we name the matrix m for later reference. Also notice that column elements are separated by commas (e.g. 1, .3, .4), and rows are separated by backslashes (\). You can leave out the -input- command if you want and just use -matrix m-, but for programs that use a lot of matrix commands, I prefer to keep the input statement.

The -matrix list- command will display the contents of m afterward to verify the result, seen in the following figure:

Stata correlation matrix

Convert the Correlation Matrix to Data

Now that you have a correlation matrix created, we need to convert this correlation matrix into usable data points. I do that with the following code:

set seed 12345
set obs 100
corr2data a b c, corr(m)

The first line (-set seed-)sets my standard random number seed so you can replicate the results shown here. The next line (-set obs-) defines a data set with 100 observations. The number of observations need to be defined before we convert the correlation matrix so Stata will know how many data points to create.

The last line above (-corr2data-) is the critical command in this process. This command tells Stata to make three random normal variates, named a, b, and c. The -corr()- option tells Stata to define these variables using the correlation structure in matrix m.

If you want to use a covariance matrix instead of a correlation matrix, creating the matrix uses the same steps. The only difference is that you need to use the -cov()- option instead of -corr()-.

Once you convert the matrix m into a data set using -corr2data-, you can verify the results with the following commands:

list in 1/10, sep(0)
corr a b c

Your output should look like this now:

Stata data list correlation matrix

Do I Need to Use a Square Matrix?

It can take some time to enter the commands to make a square matrix, since you are entering nearly all of the values in twice. It would certainly be nicer if you could simply provide an upper or lower triangular matrix.

Fortunately, -corr2data- allows for these options with the -cstorage()- option. Here’s how to do it.

Begin by creating a new matrix object:

clear
 matrix input n = (1, .3, 1, .4, 0, 1)
 matrix list n

Notice that the matrix n has the same correlation structure as m, but no longer has the row separators. Instead, we simply define the lower triangle using a row vector.

Now you can convert the row vector into data, and tell Stata that the vector represent a lower triangular matrix using the following command:

set seed 12345
set obs 100
corr2data d e f, corr(n) cstorage(lower)

Again, we set the random number seed to get identical results as before, and define 100 observations. The -corr2data- command instructs Stata to create three variables, d, e, and f, using the correlation matrix n.

However, now we include an additional option: -cstorage(lower)-. This option let’s Stata know that the row vector n represents a lower triangle matrix with the following form:

1	

.3	1	

.4	0	1

If for some reason you have a penchant for upper triangular matrices, you can use those with the -cstorage(upper)- option in -corr2data-. However, be aware that you would need to create the initial row vector with the following matrix command:

matrix input o = (1, .3, .4, 1, 0, 1)

Gimme the Code!

Once you’ve got your random variates with a known correlation structure created, you can convert their formats to match your needs.

To make it easier for you to experiment with, the complete code for creating data using a square, lower triangle, or upper triangle matrix is given below.

Happy coding!

// Example 1
// Create data set with known correlation structure
// Use a square matrix
clear
matrix input m = (1, .3, .4\ .3, 1, 0\ .4, 0, 1)
matrix list m
set seed 12345
set obs 100
corr2data a b c, corr(m)

// show data and verify correlation
list in 1/10, sep(0)
corr a b c

// Example 2
// Create data set with known correlation structure
// Use a lower triangle matrix...note the '\' are removed
clear
matrix input n = (1, .3, 1, .4, 0, 1)
matrix list n
set seed 12345
set obs 100
corr2data d e f, corr(n) cstorage(lower)

// show data and verify correlation
list in 1/10, sep(0)
corr d e f

// Example 3
// Create data set with known correlation structure
// Use an upper triangle matrix...note the '\' are removed
clear
matrix input o = (1, .3, .4, 1, 0, 1)
matrix list o
set seed 12345
set obs 100
corr2data g h i, corr(o) cstorage(upper)

// show data and verify correlation
list in 1/10, sep(0)
corr g h i

Agree? Disagree? Tell Me What You Think