How to Preserve Missing Values with Stata’s Collapse Command

You are a code-writing machine.

That 3-day project you started this morning might actually be completed by the end of the day.

As your fingers fly across the keyboard, you think you can hear Stata singing your praise softly in the background.

Then IT happens…

Your programs stops working right. The data begin looking like something from one of Lord Voldemort’s nightmares.

Your finely-tuned debugging skills kick in, and you track down the problem. That -collapse- command you issued a while back did something rather odd. It replaced all of the missing values in your data set with zeros!

But that’s not at all what you wanted! You wanted those to be missing values, not zeros.

Yep, we’ve all been there. Even the most seasoned Stata users get bit by this quirk every once in a while.

In this article, I show three ways Stata can treat missing values when using the -collapse- command and the sum() function.

How Do I Get Stata to Treat Missing Values The Way I Want?

Like any program, Stata certainly has its quirks. One of those quirks shows up when using the -collapse- command and the sum() function.

The basic issue is in the way sum() treats missing data; namely, the missing values evaluate to zero.

This can be a serious problem if zeros and missing values are substantively different in your data. For example, a missing value might occur due to survey nonresponse by a respondent. But, this does not always mean it is acceptable to treat the missing data as a zero.

If you’ve never encountered this quandary before, then count yourself lucky. Most of us will run into this kind of scenario eventually.

Fortunately, there are a finite number of ways to deal with the problem:

  1. Treat the missings as zeros: what the standard collapse command does.
  1. Evaluate collapse sums with any missing values as missing.
  1. Hybrid approach: differentiate between missing and other values, such as zeros.

Treat Missing Values as Zeros

If you are willing to treat missing values as zeros, then using the standard -collapse- command and sum() function is fine.

Entering the following syntax in Stata demonstrates this.

input id x1 x2 x3
1 0 1 .
1 . 0 2
2 1 . 1
2 0 . 1
3 . 1 0
4 1 . .
4 1 . 1
4 1 . .
5 . . .
5 . . .
collapse (sum) x1 x2 x3, by(id)

The results from your -list- command should look like the figure below.

Stata Collapse CommandThe results show that for id [1], the missing value (.) for x1 in the second row of input data has been treated as a zero, producing a summed value of zero. The same thing has happened with the missing input value of x3 in the first row: the summed value is 2, reflecting only the value in the second row of input data.

So how can you avoid this behavior? The next two sections will cover two additional scenarios.

Evaluating Collapse Sums with Any Missing Values as Missing

It is entirely possible to get around Stata’s natural treatment of missing values in the -collapse sum()- function. All you need is a little programming skill.

To start, we’ll use the same input data set we used above (all code down to the -end- command). The critical code is written in the following code block:

bysort id: egen seq = seq()
foreach v in x1 x2 x3{
    bysort id: egen c`v' = count(`v') if(`v'==.)
    replace c`v' = 1 if(c`v'==0)
    bysort id: egen c2`v' = count(c`v')
    bysort id: egen sum`v' = sum(`v')
    bysort id: egen mn`v' = mean(`v')
    replace sum`v' = -99 if(mn`v' == .)
    replace sum`v' = . if(seq > 1)
    replace sum`v' = -99 if (c2`v'~=0)
    drop c`v' c2`v' mn`v' `v'
    rename sum`v' `v'
drop seq

The code above performs 12 functions in sequence for each variable that we need aggregated (in this case x1,x2,and x3).

  1. bysort id: egen seq = seq() : This line creates a variable (seq) that counts cases within each value of the id variable.
  2. foreach v in x1 x2 x3 { } : This command encompasses the majority of the code block. For each variable x1,x2, and x3, Stata will execute the next set of codes.
  3. bysort id: egen c`v' = count(`v') if(`v'==.) : Create a variable (e.g. cx1) that marks the cases with missing values in the original variable (e.g. x1). Stata will code these values as zeros because it’s default behavior is to evaluate the sum of missing values as 0.
  4. replace c`v' = 1 if(c`v'==0): replace the zeros in our new variable (i.e. cx1) with values of 1.
  5. bysort id: egen c2`v' = count(c`v'): Now we create another new variable (e.g. c2x1), which is the sum of the [1] values we created in the previous two steps, within each level of id. So, c2x1 is a 1 for both cases with id = 1 since only one of these cases is a missing value. This variable will ultimately help us determine which cases should be given missing data values for the final -collapse- function.
  6. bysort id: egen sum`v' = sum(`v'): Here, we simply create a variable (e.g. sumx1) that contains the standard -egen sum()- of the original variable. For example, the value of sumx1 is 0 for cases with id = 1 since the missing value in the second case is treated as missing. For cases with id = 4, the value of sumx1 is 3. These variables will end up being substituted for the original data in the -collapse- command (see the -rename- command below).
  7. bysort id: egen mn`v' = mean(`v') : this line creates a variable (e.g. mnx1) that contains the average of the values in x1, within each value of id. If one of the original variables (e.g. x1) is completely missing for all cases of a particular value of id, then the mean value (e.g. mnx1) will also be missing. This variable serves to identify these cases in the data.
  8. replace sum`v' = -99 if(mn`v' == .) : Here we start specifying the values that will be considered missing. As mentioned above, if the mean of a set of cases is missing (mnx1 = .), then we want the sum in the collapsed data to also be missing. This command will accomplish that task.
  9. replace sum`v' = . if(seq > 1) : Now, just to make sure we don’t accidentally double count our sums in the data, this line of code replaces all values of sum`v' variables that have seq values greater than [1] with missing (.) values. Thus, only one case per value of id will have a value. The rest will be missing, and will evaluate as zeros in the -collapse sum()- function.
  10. replace sum`v' = -99 if (c2`v'~=0) : Finally, since we want the presence of any missing value in the original data to produce a missing value in the collapsed data, we need to identify values of id that have missing values in their constituent cases. The c2`v' variables will tell us if there are any missing values. For example, c2x1 is [0] for id = 2 since there are no missing values in x1. But c2x1 is [2] for id = 5 since both cases are missing in x1.
  11. The -drop- and -rename- commands are used to clean up some of the unneeded variables, and replace the original variables with our new variables that identify the missing values with -99.
  12. Now that the -foreach- command is completed (see closing curly brace), we drop the seq variables since it is no longer needed.

Now, if you stop here and enter a -list- command, your data should look like this:Stata Collapse with Missing Values

The only thing left to do at this point, is run a standard -collapse- command and replace the negative numbers with missing values. Note: you should use missing data values (e.g. -99 here) that are appropriate for your data and will allow you to identify missing data easily.

collapse (sum) x1 x2 x3, by(id)
foreach v in x1 x2 x3 {
    replace `v' = . if(`v' < 0)

If all went well, you should get a final data set in which any missing value in the original data would trigger a missing value in the collapsed data:

Stata Collapse with Missing Values

The Hybrid Approach: Some Missing, Some Zeros

Now that you’ve seen the nuts and bolts of the code for customizing the -collapse- command in Stata, I’ll show you how to perform a hybrid aggregation.

We’ll use the same starting data set, and I won’t go into line-by-line detail on the code. But the following code block treats missing data in the following ways.

First, if all of the cases for a value of id are missing, then the collapsed value will be missing. Otherwise, if at least one cases has non-missing data, then any other missing values will be treated as zeros to preserve the non-missing data.

bysort id: egen seq = seq(
foreach v in x1 x2 x3{
    bysort id: egen sum`v' = sum(`v')
    bysort id: egen mn`v' = mean(`v')
    replace sum`v' = -99 if(mn`v' == .)
    replace sum`v' = . if(seq > 1)
    drop mn`v' `v'
    rename sum`v' `v'
drop seq
collapse (sum) x1 x2 x3, by(id)
foreach v in x1 x2 x3 {
    replace `v' = . if(`v' < 0)

If all went well, your final dataset should look like this figure:

Stata Collapse with Missing Values


So there you have three ways to use the -collapse- command with the sum() function and missing data. While Stata programmers have done a great job with the software, there are times that the standard functional behavior of the program is not what you want. In those cases, these pieces of code might come in handy.

If you want to know more about how Stata is programmed to handle missing values – and more importantly, why – then I recommend two of Nick Cox’s Statlist postings on the subject:

Happy Coding!


3 thoughts on “How to Preserve Missing Values with Stata’s Collapse Command

  1. Thanks for this – you might mention that the second and third approaches work only if the sum is positive. If the sum is negative, line 3 in the second part of the second approach and line 13 in the hybrid approach will turn it into a missing value. An alternative might be to pick an even more unlikely place-holder sum than -99, such as -99.123456789 and then revise line13:
    replace `v’ = . if(`v’ == -99.123456789)

    • CK, thanks for pointing out this issue. I really appreciate your input. You are correct, my current program works only if the sum of the non-missing values is positive. Your revision definitely fixes this problem. In lieu of choosing a highly specific decimal value, I might recommend choosing a missing data value that is clearly an implausible out-of-bounds value, such as -999999999.

      Unfortunately, there is no value that is guaranteed to work in all substantive situations. For example, a missing data value of -999999999 might not be viable for an economist studying changes in GDP across industrialized nation states. But it might be perfectly plausible for a sociologist studying city-level crime counts. I recommend researchers carefully consider the implications of using any specific missing data value given their substantive areas and plausible data values. My intention in this post was simply to illustrate the logic of the program, and how to overcome Stata’s logical – but somtimes undesireable – aggregation behavior.

Agree? Disagree? Tell Me What You Think

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s