If your favorite isn’t on the list, I’m sorry…I can only do so much.
One thing to keep in mind about these examples is that most software packages use floating point arithmetic (FPA).
I won’t get into exactly what this means in this post. Just know that FPA will generally result in some rounding errors with highly precise numbers (i.e. lots of decimal places). However, below 16 decimal places, you can be reasonably assured that these packages return the same values.
=PI()
Note this function does not have any arguments. The value returned is accurate to 14 decimal places.
>pi
This returns pi to 6 decimal places. If you need more precision, you can get up to 15 decimal places with the following code (the integer 3, is the 16^{th} digit):
>options(digits=16) >pi
The digits
option can go as high as 22, but the default R algorithm is only accurate up to 15 decimal places (see http://www.joyofpi.com/pi.html).
For greater precision, I recommend using the Rmpfr
package. I set it to 256-bit precision, and achieved accuracy up to 75 decimal places.
. di c(pi) or . di _pi
As with R, the default precision is 6 decimal places. If you need to increase the precision, you can format the constant for up to 16 decimal places.
.di %19.0g _pi
I know less about the nuances of representing Pi in SAS. But my research in the SAS documentation suggests that pi can be stored with precision above 16 decimal places.
The basic code is:
data _null_; pi=constant('pi'); put pi=; run;
This may be the worst package to use for representing pi, as IBM still has not included pi as a system constant in the program. Instead, we get to make use of our knowledge in trigonometry (did you just cringe? I did.)…
If you dig back far enough in your memory, you might recall that the tangent of (pi/4) =1. Using the inverse tangent function (the arctangent), you can create a variable to represent pi:
compute pi = 4*ARTAN(1).
Hope you find this interesting and useful…Happy Pi Day!
]]>Monte Carlo simulations are most commonly used to understand the properties of a particular statistic such as the mean, or an estimator like maximum likelihood (ML) regression methods.
The principal is straight forward. Create a data set with a known correlation or covariance structure. Then add in some random error, and estimate your statistic or model.
Replicate this process 1,000 or 10,000 times – collecting the relevant information from each trial – and you’ll have a nice sampling distribution with which to evaluate the properties of your model or statistic.
The replication can be accomplished easily enough with a -forvalues-
loop.
In this article, you’ll find out how to accomplish the other part of the task: creating a data set with a known correlation structure.
The first think you need to do to create your data set, is decide what you want the correlation or covariance matrix to look like.
I’m going to create a correlation matrix here, since that seems to be easier for most people to think about. Creating a covariance matrix uses the same -matrix-
command here, but will require an extra command option later on during the procedure.
Let’s make the following square correlation matrix:
1 .3 .4 .3 1 0 .4 0 1
To do this, you need to use the -matrix input-
command as such:
clear matrix input m = (1, .3, .4\ .3, 1, 0\ .4, 0, 1) matrix list m
In this command, we name the matrix m
for later reference. Also notice that column elements are separated by commas (e.g. 1, .3, .4), and rows are separated by backslashes (\). You can leave out the -input-
command if you want and just use -matrix m-
, but for programs that use a lot of matrix commands, I prefer to keep the input statement.
The -matrix list-
command will display the contents of m
afterward to verify the result, seen in the following figure:
Now that you have a correlation matrix created, we need to convert this correlation matrix into usable data points. I do that with the following code:
set seed 12345 set obs 100 corr2data a b c, corr(m)
The first line (-set seed-
)sets my standard random number seed so you can replicate the results shown here. The next line (-set obs-
) defines a data set with 100 observations. The number of observations need to be defined before we convert the correlation matrix so Stata will know how many data points to create.
The last line above (-corr2data-
) is the critical command in this process. This command tells Stata to make three random normal variates, named a, b, and c. The -corr()-
option tells Stata to define these variables using the correlation structure in matrix m
.
If you want to use a covariance matrix instead of a correlation matrix, creating the matrix uses the same steps. The only difference is that you need to use the -cov()-
option instead of -corr()-
.
Once you convert the matrix m
into a data set using -corr2data-
, you can verify the results with the following commands:
list in 1/10, sep(0) corr a b c
Your output should look like this now:
It can take some time to enter the commands to make a square matrix, since you are entering nearly all of the values in twice. It would certainly be nicer if you could simply provide an upper or lower triangular matrix.
Fortunately, -corr2data-
allows for these options with the -cstorage()-
option. Here’s how to do it.
Begin by creating a new matrix object:
clear matrix input n = (1, .3, 1, .4, 0, 1) matrix list n
Notice that the matrix n
has the same correlation structure as m
, but no longer has the row separators. Instead, we simply define the lower triangle using a row vector.
Now you can convert the row vector into data, and tell Stata that the vector represent a lower triangular matrix using the following command:
set seed 12345 set obs 100 corr2data d e f, corr(n) cstorage(lower)
Again, we set the random number seed to get identical results as before, and define 100 observations. The -corr2data-
command instructs Stata to create three variables, d, e, and f, using the correlation matrix n
.
However, now we include an additional option: -cstorage(lower)-
. This option let’s Stata know that the row vector n
represents a lower triangle matrix with the following form:
1 .3 1 .4 0 1
If for some reason you have a penchant for upper triangular matrices, you can use those with the -cstorage(upper)-
option in -corr2data-
. However, be aware that you would need to create the initial row vector with the following matrix command:
matrix input o = (1, .3, .4, 1, 0, 1)
Once you’ve got your random variates with a known correlation structure created, you can convert their formats to match your needs.
To make it easier for you to experiment with, the complete code for creating data using a square, lower triangle, or upper triangle matrix is given below.
Happy coding!
// Example 1 // Create data set with known correlation structure // Use a square matrix clear matrix input m = (1, .3, .4\ .3, 1, 0\ .4, 0, 1) matrix list m set seed 12345 set obs 100 corr2data a b c, corr(m) // show data and verify correlation list in 1/10, sep(0) corr a b c // Example 2 // Create data set with known correlation structure // Use a lower triangle matrix...note the '\' are removed clear matrix input n = (1, .3, 1, .4, 0, 1) matrix list n set seed 12345 set obs 100 corr2data d e f, corr(n) cstorage(lower) // show data and verify correlation list in 1/10, sep(0) corr d e f // Example 3 // Create data set with known correlation structure // Use an upper triangle matrix...note the '\' are removed clear matrix input o = (1, .3, .4, 1, 0, 1) matrix list o set seed 12345 set obs 100 corr2data g h i, corr(o) cstorage(upper) // show data and verify correlation list in 1/10, sep(0) corr g h i]]>
It doesn’t take long for new analysts to learn that copying and pasting code really speeds up the time needed to complete any job. This seems to be especially true when you need to create groups of new variables, or when performing the same transformation to a set of fields.
The reality is that copying and pasting code in these instances is actually the long way of accomplishing a task. Sure, the code will be easy to read. But you could complete the same tasks in a fraction of the time.
Fortunately, Stata has a set of built-in tools to make this process easier.
This article will show you how to use the -forvalues-
command in Stata in order to automate repetitive tasks. Learning how to use this tool will help make your data analysis code cleaner, shorter, and faster to write.
Early on in their education, every programmer learns about loops. Loops tell a computer to perform a task or set of tasks repetitively, according to a specific set of criteria.
Usually we want to automate a task to be performed across a set of variables, perform the same commands using different numeric values in each iteration, or repeat code with each item from a given list.
This is what computers do best. In fact, some programmers would say that if you write the same piece of code more than once in a program, you’re wasting your time.
A good example of how loops are useful comes from working with decennial census data. A frequent task that analysts need to perform is the estimation of data values for intercensal years (those that fall between census collection points). Perhaps the simplest method for accomplishing this task is to use linear interpolation between the decennial census values. Calculate the average annual change in the data value using the decennial data points. Then generate nine new variables, adding the change value to each successive field.
The simple code to interpolate data between variables x1990 and x2000 might look something like this:
gen xdelta = (x2000 – x1990) / 10 gen x1991 = x1990 + xdelta gen x1992 = x1991+ xdelta gen x1993 = x1992 + xdelta gen x1994 = x1993 + xdelta gen x1995 = x1994 + xdelta gen x1996 = x1995 + xdelta gen x1997 = x1996 + xdelta gen x1998 = x1997 + xdelta gen x1999 = x1998 + xdelta
First of all, this code works. It makes sense, it’s easy to read, and it does the job we set out to do. For some analysts, this is enough and there’s no need to get fancy.
What if you had to do this 30 or 40 times…or 100…or 500. Are your eyeballs spinning yet?
With a loop, this procedure can be accomplished with only three lines of code:
forvalues y = 1991(1)1999{ gen x`y' = x1990 + (`y' – 1990)*((x2000 – x1990) / 10) }
Let’s dig in…
-forvalues-
In the example above, I use Stata’s -forvalues-
command to create nine new variables. Each variable represents the next step in a linear progression from the x1990 value to the x2000 value.
The -forvalues-
command consists of two pieces of code that work together:
1. The portion that controls where the loop begins, and how long the program should loop for.
2. The commands that you want to have repeated during each segment of the loop.
Conceptually, the command looks like this:
forvalues “loop control” { repeated command }
The loop control begins by specifying the name of a local macro used to refer back to the values you are looping through. In this example, I use y
as the name of the local macro.
The next section of the loop control specifies the starting value for y
, how much to increment y
by with each loop, and an ending value. So, I start the loop with y = 1991
. With each successive run through the loop, Stata will increase that value by 1. And the loop will end at 1999.
The repeated command tells Stata what to do with the values in the loop control section. In the code above, Stata creates nine new variables (x1991 to x1999) using the -gen x`y'-
command. Here `y'
is used to refer to the local macro defined in the loop control.
As the -gen-
command creates each of the new variables, they are set equal to the value of x1990, plus some number of years (`y’ – 1990), times the average annual change in the x variable ((x2000 – x1990)/10).
There are a few simple rules you need to follow when using the -forvalues-
command:
1. The open brace ({) must be on the same line as the -forvalues-
command.
2. The first command to be executed within -forvalues-
must be on a new line.
3. The close brace (}) must also be on a line of its own.
4. The -forvalues-
looks for numeric values in the local macro of the loop control. If you want to use strings (i.e. text values), you’ll need to use -foreach-
instead.
Here is some example code, with the output so you can try for yourself. Begin by creating a small fake data set to work with. Make sure you include the -set seed 12345-
command so you get the same results I show below.
clear set seed 12345 set obs 10 gen x1 = rnormal() gen x5 = abs(x1) + rnormal() gen delta = (x5 - x1)/4 list forvalues v = 2(1)4 { gen x`v' = x1 + (`v' - 1)*((x5 - x1)/4) } list delta x1 x2-x4 x5
Your results after the first -list-
command should look like the figure below:
The -forvalues-
loop simply generates three new variables (x2 – x4
) that represent the interpolated values between x1
and x5
. If all went well, your results should look like the figure below. Notice that the values from x1
to x5
change by the value of delta
with each step.
I hope this post takes some of the mystery out of the -forvalues-
command. In upcoming posts, I’ll show you how to use the -foreach-
and -while-
command to create loops for different scenarios.
If you have any questions, feel free to ask them in the comments below. And don’t forget to subscribe to this blog via email to get the follow-up posts and new content as I post it!
Happy coding!
]]>You are a code-writing machine.
That 3-day project you started this morning might actually be completed by the end of the day.
As your fingers fly across the keyboard, you think you can hear Stata singing your praise softly in the background.
Then IT happens…
Your programs stops working right. The data begin looking like something from one of Lord Voldemort’s nightmares.
Your finely-tuned debugging skills kick in, and you track down the problem. That -collapse-
command you issued a while back did something rather odd. It replaced all of the missing values in your data set with zeros!
But that’s not at all what you wanted! You wanted those to be missing values, not zeros.
Yep, we’ve all been there. Even the most seasoned Stata users get bit by this quirk every once in a while.
In this article, I show three ways Stata can treat missing values when using the -collapse-
command and the sum()
function.
How Do I Get Stata to Treat Missing Values The Way I Want?
Like any program, Stata certainly has its quirks. One of those quirks shows up when using the -collapse-
command and the sum()
function.
The basic issue is in the way sum()
treats missing data; namely, the missing values evaluate to zero.
This can be a serious problem if zeros and missing values are substantively different in your data. For example, a missing value might occur due to survey nonresponse by a respondent. But, this does not always mean it is acceptable to treat the missing data as a zero.
If you’ve never encountered this quandary before, then count yourself lucky. Most of us will run into this kind of scenario eventually.
Fortunately, there are a finite number of ways to deal with the problem:
If you are willing to treat missing values as zeros, then using the standard -collapse-
command and sum()
function is fine.
Entering the following syntax in Stata demonstrates this.
clear input id x1 x2 x3 1 0 1 . 1 . 0 2 2 1 . 1 2 0 . 1 3 . 1 0 4 1 . . 4 1 . 1 4 1 . . 5 . . . 5 . . . end collapse (sum) x1 x2 x3, by(id) list
The results from your -list-
command should look like the figure below.
The results show that for id [1], the missing value (.) for x1 in the second row of input data has been treated as a zero, producing a summed value of zero. The same thing has happened with the missing input value of x3 in the first row: the summed value is 2, reflecting only the value in the second row of input data.
So how can you avoid this behavior? The next two sections will cover two additional scenarios.
It is entirely possible to get around Stata’s natural treatment of missing values in the -collapse sum()-
function. All you need is a little programming skill.
To start, we’ll use the same input data set we used above (all code down to the -end-
command). The critical code is written in the following code block:
bysort id: egen seq = seq() foreach v in x1 x2 x3{ bysort id: egen c`v' = count(`v') if(`v'==.) replace c`v' = 1 if(c`v'==0) bysort id: egen c2`v' = count(c`v') bysort id: egen sum`v' = sum(`v') bysort id: egen mn`v' = mean(`v') replace sum`v' = -99 if(mn`v' == .) replace sum`v' = . if(seq > 1) replace sum`v' = -99 if (c2`v'~=0) drop c`v' c2`v' mn`v' `v' rename sum`v' `v' } drop seq
The code above performs 12 functions in sequence for each variable that we need aggregated (in this case x1
,x2
,and x3
).
bysort id: egen seq = seq()
: This line creates a variable (seq
) that counts cases within each value of the id
variable.foreach v in x1 x2 x3 { }
: This command encompasses the majority of the code block. For each variable x1
,x2
, and x3
, Stata will execute the next set of codes.bysort id: egen c`v' = count(`v') if(`v'==.)
: Create a variable (e.g. cx1) that marks the cases with missing values in the original variable (e.g. x1). Stata will code these values as zeros because it’s default behavior is to evaluate the sum of missing values as 0.replace c`v' = 1 if(c`v'==0)
: replace the zeros in our new variable (i.e. cx1) with values of 1.bysort id: egen c2`v' = count(c`v')
: Now we create another new variable (e.g. c2x1), which is the sum of the [1] values we created in the previous two steps, within each level of id
. So, c2x1
is a 1
for both cases with id = 1
since only one of these cases is a missing value. This variable will ultimately help us determine which cases should be given missing data values for the final -collapse-
function.bysort id: egen sum`v' = sum(`v')
: Here, we simply create a variable (e.g. sumx1) that contains the standard -egen sum()-
of the original variable. For example, the value of sumx1
is 0 for cases with id = 1
since the missing value in the second case is treated as missing. For cases with id = 4
, the value of sumx1
is 3. These variables will end up being substituted for the original data in the -collapse-
command (see the -rename-
command below).bysort id: egen mn`v' = mean(`v')
: this line creates a variable (e.g. mnx1) that contains the average of the values in x1
, within each value of id
. If one of the original variables (e.g. x1
) is completely missing for all cases of a particular value of id
, then the mean value (e.g. mnx1) will also be missing. This variable serves to identify these cases in the data.replace sum`v' = -99 if(mn`v' == .)
: Here we start specifying the values that will be considered missing. As mentioned above, if the mean of a set of cases is missing (mnx1 = .
), then we want the sum in the collapsed data to also be missing. This command will accomplish that task.replace sum`v' = . if(seq > 1)
: Now, just to make sure we don’t accidentally double count our sums in the data, this line of code replaces all values of sum`v'
variables that have seq
values greater than [1] with missing (.) values. Thus, only one case per value of id
will have a value. The rest will be missing, and will evaluate as zeros in the -collapse sum()-
function.replace sum`v' = -99 if (c2`v'~=0)
: Finally, since we want the presence of any missing value in the original data to produce a missing value in the collapsed data, we need to identify values of id
that have missing values in their constituent cases. The c2`v'
variables will tell us if there are any missing values. For example, c2x1
is [0] for id = 2
since there are no missing values in x1
. But c2x1
is [2] for id = 5
since both cases are missing in x1
.-drop-
and -rename-
commands are used to clean up some of the unneeded variables, and replace the original variables with our new variables that identify the missing values with -99
.-foreach-
command is completed (see closing curly brace), we drop the seq
variables since it is no longer needed.Now, if you stop here and enter a -list-
command, your data should look like this:
The only thing left to do at this point, is run a standard -collapse-
command and replace the negative numbers with missing values. Note: you should use missing data values (e.g. -99 here) that are appropriate for your data and will allow you to identify missing data easily.
collapse (sum) x1 x2 x3, by(id) foreach v in x1 x2 x3 { replace `v' = . if(`v' < 0) }
If all went well, you should get a final data set in which any missing value in the original data would trigger a missing value in the collapsed data:
Now that you’ve seen the nuts and bolts of the code for customizing the -collapse-
command in Stata, I’ll show you how to perform a hybrid aggregation.
We’ll use the same starting data set, and I won’t go into line-by-line detail on the code. But the following code block treats missing data in the following ways.
First, if all of the cases for a value of id
are missing, then the collapsed value will be missing. Otherwise, if at least one cases has non-missing data, then any other missing values will be treated as zeros to preserve the non-missing data.
bysort id: egen seq = seq( foreach v in x1 x2 x3{ bysort id: egen sum`v' = sum(`v') bysort id: egen mn`v' = mean(`v') replace sum`v' = -99 if(mn`v' == .) replace sum`v' = . if(seq > 1) drop mn`v' `v' rename sum`v' `v' } drop seq collapse (sum) x1 x2 x3, by(id) foreach v in x1 x2 x3 { replace `v' = . if(`v' < 0) }
If all went well, your final dataset should look like this figure:
So there you have three ways to use the -collapse-
command with the sum()
function and missing data. While Stata programmers have done a great job with the software, there are times that the standard functional behavior of the program is not what you want. In those cases, these pieces of code might come in handy.
If you want to know more about how Stata is programmed to handle missing values – and more importantly, why – then I recommend two of Nick Cox’s Statlist postings on the subject:
http://www.stata.com/statalist/archive/2010-02/msg00430.html
http://www.stata.com/statalist/archive/2005-09/msg00944.html
Happy Coding!
]]>When it comes to data analysis, if you’re anything like me you probably work across several different platforms. Depending on your analytical needs you might get basic descriptives from Excel, but use programs like Stata and R for more complex routines.
One of the frustrations that go with this form of data science is the need to transfer data from one program to another.
It’s straight forward to export data in .csv format, and then import the data in a different program. But you may lose some important formatting such as variable and value labels in the data set.
Programs such as Stat Transfer make it easy to convert data from one program format to another. But as with the .csv export, it takes valuable time to convert and transfer the data. And you end up with multiple copies of the same data set clutering up your machine.
Wouldn’t it be way easier if you could just call one data analysis program from inside another? As a Stata user, I’ve often wished I could perform a quick analysis in R without having to go through all of this effort.
In this article, I’ll show you a method for writing your R code, running R, feeding it data, returning R output in a text file, and returning any changes in your dataset to Stata…all while working in Stata’s native environment. I’m doing this on a PC, so Mac users will need to forgive me.
I like to create the simplest path name I can, so for testing purposes I create a Stata folder right on the C:/ drive.
clear set more off log close _all cd "c:/Stata/"
Once that’s done, I create a fake data set with a known correlation structure using Stata’s -matrix-
and -corr2data-
commands.
set obs 100 matrix c = (1,-.5,0 \ -.5,1,.4 \ 0,.4,1) corr2data x y z, corr(c)
Now save your test data set in the temp folder you created above
save "testout.dta" file close _all
So much for the foreplay…now let’s have fun! Using Stata’s -file-
command, we create a new file to hold the R code we want to run. This new file will be called test.R
, but in Stata we’ll refer to it by the alias rcode
.
file open rcode using test.R, write replace
Next, we tell Stata that we want to write something to our new file. That something is a list of R commands that will set a new working directory (c:/stata
), read in our dataset, run analyses, and return the augmented data.
Since we want to write a text file for R to run, we’ll need to enclose the commands in `” and “’ quotes (notice the combination of single and double quotes). The quote combo is necessary because we’re including quotes inside the text of the R program.
Finally, notice that we need to end each line of the text writing process with _newline
, except the last line. This tells Stata to create a new line in the text file. Finally, we finish writing the R program text file by using the -file close-
command.
file write rcode /// `"setwd("c:/Stata/")"' _newline /// `"library(foreign)"' _newline /// `"data<-data.frame(read.dta("testout.dta"))"' _newline /// `"attach(data)"' _newline /// `"x2<-x*2"' _newline /// `"data2<-cbind(data,x2)"' _newline /// `"write.dta(data2,"testin.dta")"' file close rcode
Stata can invoke an operating system window (i.e. a command prompt) using the -shell-
, or alternatively the !
command. All you need to do is provide Stata with the complete path and filename of the program you want to run. Adding the code CMD BATCH
tells Windows to run R in batch mode. Finally, we run the R script by telling R to execute the contents of test.R
.
shell "C:\Program Files\R\R-2.15.1\bin\x64\R.exe" CMD BATCH test.R
Now we can read the output file from R back into Stata and summarize the changes to the dataset. I also clean up the directory by removing unneeded files using the -rm-
command.
use testin.dta, clear summarize rm testout.dta rm test.R rm .RData
I leave the test.Rout
file so we can see the log from R, including output and the run-time log.
Now you’ve got your original data (plus an extra variable) back in Stata, and you have a log of the R results from your script. If you want to use this example as a template to start calling R from Stata for your own analyses, I’m including the complete script and comments in the code box below (note: I added some -quietly-
commands to keep your Stata log window a bit cleaner).
Let me know what you think about this, and happy coding!!
*// Set Working Directory clear set more off log close _all cd "c:/Stata/" *// Create Data set obs 100 matrix c = (1,-.5,0 \ -.5,1,.4 \ 0,.4,1) corr2data x y z, corr(c) *// Export in CSV format quietly: save "testout.dta" quietly: file close _all *// Write R Code *// dependencies: foreign quietly: file open rcode using test.R, write replace quietly: file write rcode /// `"setwd("c:/Stata/")"' _newline /// `"library(foreign)"' _newline /// `"data<-data.frame(read.dta("testout.dta"))"' _newline /// `"attach(data)"' _newline /// `"x2<-x*2"' _newline /// `"data2<-cbind(data,x2)"' _newline /// `"write.dta(data2,"testin.dta")"' quietly: file close rcode *// Run R quietly: shell "C:\Program Files\R\R-2.15.1\bin\x64\R.exe" CMD BATCH test.R *// Read Revised Data Back to Stata quietly: use testin.dta, clear summarize *// Clean up rm testout.dta rm test.R rm .RData]]>
From time to time, I’ll be including code snippets of various programming languages in my posts. And I thought you might find it interesting to know how these are being created.
There are a few different methods for creating and highlighting code snippets. We can refer to them as the <code> method, the <pre> method, and the plugin method.
And in fact, I used some special codes to write the <code> and <pre> in the previous sentence…but we’ll get to that in just a minute. Let’s start with the main methods for introducing code snippets.
The <code> method
Sometimes you just want to reformat text in a paragraph so that readers know you are referring to programming code. The easiest way to do that is to enclose the commands in <code> and </code> tags:
For example, in this sentence I offset the programming code command
phrase by wrapping it in the appropriate tags.
Here is what I wrote as the underlying sentence:
For example, in this sentence I offset the <code>programming code command</code> phrase by wrapping it in the appropriate tags.
WordPress knows that the HTML tags (denoted by the < and >) indicate to change the text between them into a monospaced type that is typical of programming texts.
If you write a block of text using the <code> method, you will get monospaced type with a blank background:
set obs 500
gen x = rnormal()
gen y = 5 - .3*x + rnormal()
regress y x
The <pre> method
The <pre> method also converts your words into monospaced type, but it also add a nice grey background to highlight the code block:
set obs 500 gen x = rnormal() gen y = 5 - .3*x + rnormal() regress y x
What I actually wrote above is this:
<pre> set obs 500
gen x = rnormal()
gen y = 5 - .3*x + rnormal()
regress y x</pre>
No HTML coding inside <code> & <pre> tags
One drawback of the <code> and <pre> methods is that neither works with HTML tags inside the code block. Contrary to what other sources may say, the presence of HTML tags enclosed in < and > will cause your browser to follow the HTML commands. The tags themselves won’t appear.
For example, if I wrote the following code:
<pre>Here is the code to create a heading:
<h1>Level 1 Heading</h1></pre>
You would actually see:
Here is the code to create a heading:
In order to show the <h1> and </h1>, we need to be able to show the < and > characters. This is done by typing < and > wherever you want < and >.
For example, to get the first code block in this section of the post I actually typed:
<code><pre>Here is the code to create a heading: <h1>Level 1 Heading</h1></pre></code>
And, as you might have guessed, the code block you just read is wrapped in <pre> tags that you don’t see. I’ve been making extensive use of this little trick in this post.
The plugin method
The <code> and <pre> methods are relatively simple to implement once you get used to them. And they provide a clean and simple way to display programming code in a post.
But I like things a little fancier. I love developer software, and especially IDEs (Integrated Development Environments) that give line numbers and color-coded highlighting for the language you are using.
The easiest way to do this in WordPress is like this:
[sourcecode language="r"] y <- duncan.model (prestige ~ income + education) summary(duncan.model) [/sourcecode]
Here I am writing code for the R statistical programming language. Hence, the option language=”r”
tells WordPress how to highlight the text. The result looks like this:
y <- duncan.model(presitge ~ income + education) summary(duncan.model)
Not only do you get line numbers, alternating shading, and highlighted code, if you run your cursor over the code block you will see several options to view the source code, copy to clipboard, or print the code. You might also note that the tilde is properly represented in the fancy code block, but looks much like a ‘-‘ sign in the basic formatting.
The [sourcecode] function can highlight several different languages. Unfortunately, if you are a Stata user code highlighting isn’t available yet. But you can still get the other options, by specifying language=”text”
:
[sourcecode language="text"] set obs = 100 gen x = rnormal() gen y = 5 - .3*x + rnormal() regress y x [/sourcecode]
produces the following code block:
set obs = 100 gen x = rnormal() gen y = 5 - .3*x + rnormal() regress y x
This feature is available on WordPress.com blogs, and is based on Alex Gorbatchev’s program called SyntaxHighlighter. If you use WordPress on a different host, you can get the program as plugin.
Overall, these several methods for introducing code snippets should cover most of your needs. For more information on adding code to posts, see http://en.support.wordpress.com/code/posting-source-code/. And for more information on the SyntaxHighlighter plugin, see http://codex.wordpress.org/Writing_Code_in_Your_Posts.
]]>Unfortunately, the answers to these questions seem to present a quandary that was eloquently summed up by a comment I read on another blog I seem to have forgotten (perhaps it was Chandoo):
You need experience to get a job as an analyst. But the only way to get experience is to work in a job as an analyst.
Employers today are asking for more from all of their employees. And data analysts are no exception. In fact, the pressure to produce more with less is pushing many employers to merge business functions across smaller workforces.
For the data scientist and more importantly the aspiring marketing researcher or business intelligence analyst there is a three-headed monster to contend with. Each head represents a different role that you will need to fulfill in your career.
This post will help flesh out what this three-headed monster looks like and how each head behaves. Most importantly, you’ll learn what you need to know in order to slay this monster one head at a time, and become a rock star data analyst!
Data scientists in the business world need a set of skills not unlike those found in other professional research fields. This triumvirate consists of the following types of knowledge:
Each of these types of knowledge represent one facet of information that the commercial data analyst will rely on regularly to perform at their best. So, let’s look at each in more detail.
There are some who will argue that any competent analyst should be able to use data to answer questions regardless of the substantive topic area.
To a degree, this is true. For example, I can study crime rates just as effectively as studying corporate customer satisfaction scores.
But I have the advantage of having backgrounds in both business and criminology.
Substantive knowledge about the field of study helps place your results in context, and allows a frame of reference for what is normal and what is unexpected.
Whether you are working on a six sigma project, consumer loyalty and satisfaction metrics, or optimizing your company sales funnel, knowing the relevant parameters and constraints on the process is important.
To become a rock star analyst, you don’t need deep substantive training. A solid background in the fundamentals will get you going. After that, you’ll simply get better as you learn more.
It almost goes without saying that a rock star data analyst should have solid skills in data analysis. But just which skills are necessary?
In today’s digital world strong quantitative statistical skills seem most important. These skills fall under the various headings of descriptive and inferential statistics, econometrics, and frequentist and bayesian analytical perspectives.
But what many outside the realm of data science don’t understand is that good analysts should also have knowledge of research methodologies such as experimental and quasi-experimental design, measurement principles, survey design, and secondary data analysis capabilities.
I also argue that strong candidates for data scientists should have at least a fundamental understanding of qualitative data analysis. Observing a social context directly, interviewing relevant stakeholders, and running focus groups provides a richness of information that cannot be captured in any database or survey.
The rock star analyst should know how to delve into that information and make sense of the patterns. Ultimately, this ability will inform quantitative data analysis efforts, and vice versa.
There is not denying that the growing world of e-commerce and digital content rely on a bedrock of programming code.
But not all codes are written as equals.
There are interpreted languages, such as HTML, CSS, and PHP: the bedrock on which most web content is created. These languages are designed to be interpreted by other programs (e.g. web browsers for viewing HTML), and do not beed to be converted to machine language before running.
In contrast, there are compiled languages such as C++, Visual Basic, and Python. These languages allow greater flexibility in creating complex processes, making them ideal for writing everyday programs (e.g. that browser you’re using is probably written in C++).
Then there are statistical programming languages that fall somewhere in between.
On the interpreted language side, there are proprietary languages used for major packages like SAS, SPSS, and Stata. On the other hand, there are programs like Excel that can compile and execute VB code to perform a multitude of analytics tasks.
Then there are pure statistical computing languages like S and R that behave much more like compiled languages (although like Python, they aren’t strictly compile-only languages).
Now don’t get nervous…the rock star analyst doesn’t need to have a degree in computer science. However, learning something about basic programming structures will be necessary incredibly useful for efficient data management and analysis. Regardless of your platform of choice, you will at least need to learn the code for that program (a topic I’ll be exploring extensively in this blog).
I also recommend that you take the time to learn another language aside the one used by your preferred statistical package. Python is probably the most widely useful language today. But it can be a little difficult for the novice programmer. If that’s you, maybe start with something more fun such as basic HTML and CSS for web programming.
Slaying the Three-Headed Monster
All of this might sound a little overwhelming if you’re new to the data analysis profession. So, let me break it down into a summary of next steps.
First, begin learning as much as you can about statistics and research methodologies. There are a number of great websites that offer such training for free (e.g. Coursera and Kahn Academy are good places to begin your search).
Next, choose a data analysis platform to learn on. R is more difficult to learn, but is open-source (i.e. free) and gaining widespread popularity (see www.r-project.org for the latest version). It would be a good idea talk to your employer, or others in the data analysis field to see what they recommend. I primarily use Excel, Stata, and R these days.
Finally, begin learning the basics of computer programming. It doesn’t really matter which language you use, since virtually every language makes use of the same basic tools such as if-else statements, loops, arrays, and variables, etc. (it’s okay if you don’t know these terms right now…you’ll learn about them quickly once you start). I recommend finding a good book, or web-based course on Python since it can be used for everything from video-game programming, to scientific computing.
One step at a time…you’ll get there. But the most important thing is to start by taking the first step. Remember, the fastest path between two points can still only be travelled one step at a time.
]]>