# How to Preserve Missing Values with Stata’s Collapse Command

You are a code-writing machine.

That 3-day project you started this morning might actually be completed by the end of the day.

As your fingers fly across the keyboard, you think you can hear Stata singing your praise softly in the background.

Then IT happens…

Your programs stops working right. The data begin looking like something from one of Lord Voldemort’s nightmares.

Your finely-tuned debugging skills kick in, and you track down the problem. That `-collapse-` command you issued a while back did something rather odd. It replaced all of the missing values in your data set with zeros!

But that’s not at all what you wanted! You wanted those to be missing values, not zeros.

Yep, we’ve all been there. Even the most seasoned Stata users get bit by this quirk every once in a while.

In this article, I show three ways Stata can treat missing values when using the `-collapse-` command and the `sum()` function.

# How Do I Get Stata to Treat Missing Values The Way I Want?

Like any program, Stata certainly has its quirks. One of those quirks shows up when using the `-collapse-` command and the `sum()` function.

The basic issue is in the way `sum()` treats missing data; namely, the missing values evaluate to zero.

This can be a serious problem if zeros and missing values are substantively different in your data. For example, a missing value might occur due to survey nonresponse by a respondent. But, this does not always mean it is acceptable to treat the missing data as a zero.

If you’ve never encountered this quandary before, then count yourself lucky. Most of us will run into this kind of scenario eventually.

Fortunately, there are a finite number of ways to deal with the problem:

1. Treat the missings as zeros: what the standard collapse command does.
1. Evaluate collapse sums with any missing values as missing.
1. Hybrid approach: differentiate between missing and other values, such as zeros.

## Treat Missing Values as Zeros

If you are willing to treat missing values as zeros, then using the standard `-collapse-` command and `sum()` function is fine.

Entering the following syntax in Stata demonstrates this.

```clear
input id x1 x2 x3
1 0 1 .
1 . 0 2
2 1 . 1
2 0 . 1
3 . 1 0
4 1 . .
4 1 . 1
4 1 . .
5 . . .
5 . . .
end
collapse (sum) x1 x2 x3, by(id)
list
```

The results from your `-list-` command should look like the figure below.

The results show that for id [1], the missing value (.) for x1 in the second row of input data has been treated as a zero, producing a summed value of zero. The same thing has happened with the missing input value of x3 in the first row: the summed value is 2, reflecting only the value in the second row of input data.

So how can you avoid this behavior? The next two sections will cover two additional scenarios.

## Evaluating Collapse Sums with Any Missing Values as Missing

It is entirely possible to get around Stata’s natural treatment of missing values in the `-collapse sum()-` function. All you need is a little programming skill.

To start, we’ll use the same input data set we used above (all code down to the `-end-` command). The critical code is written in the following code block:

```bysort id: egen seq = seq()
foreach v in x1 x2 x3{
bysort id: egen c`v' = count(`v') if(`v'==.)
replace c`v' = 1 if(c`v'==0)
bysort id: egen c2`v' = count(c`v')
bysort id: egen sum`v' = sum(`v')
bysort id: egen mn`v' = mean(`v')
replace sum`v' = -99 if(mn`v' == .)
replace sum`v' = . if(seq > 1)
replace sum`v' = -99 if (c2`v'~=0)
drop c`v' c2`v' mn`v' `v'
rename sum`v' `v'
}
drop seq
```

The code above performs 12 functions in sequence for each variable that we need aggregated (in this case `x1`,`x2`,and `x3`).

1. `bysort id: egen seq = seq()` : This line creates a variable (`seq`) that counts cases within each value of the `id` variable.
2. `foreach v in x1 x2 x3 { }` : This command encompasses the majority of the code block. For each variable `x1`,`x2`, and `x3`, Stata will execute the next set of codes.
3. `bysort id: egen c`v' = count(`v') if(`v'==.)` : Create a variable (e.g. cx1) that marks the cases with missing values in the original variable (e.g. x1). Stata will code these values as zeros because it’s default behavior is to evaluate the sum of missing values as 0.
4. `replace c`v' = 1 if(c`v'==0)`: replace the zeros in our new variable (i.e. cx1) with values of 1.
5. `bysort id: egen c2`v' = count(c`v')`: Now we create another new variable (e.g. c2x1), which is the sum of the [1] values we created in the previous two steps, within each level of `id`. So, `c2x1` is a `1` for both cases with `id = 1` since only one of these cases is a missing value. This variable will ultimately help us determine which cases should be given missing data values for the final `-collapse-` function.
6. `bysort id: egen sum`v' = sum(`v')`: Here, we simply create a variable (e.g. sumx1) that contains the standard `-egen sum()-` of the original variable. For example, the value of `sumx1` is 0 for cases with `id = 1` since the missing value in the second case is treated as missing. For cases with `id = 4`, the value of `sumx1` is 3. These variables will end up being substituted for the original data in the `-collapse-` command (see the `-rename-` command below).
7. `bysort id: egen mn`v' = mean(`v')` : this line creates a variable (e.g. mnx1) that contains the average of the values in `x1`, within each value of `id`. If one of the original variables (e.g. `x1`) is completely missing for all cases of a particular value of `id`, then the mean value (e.g. mnx1) will also be missing. This variable serves to identify these cases in the data.
8. `replace sum`v' = -99 if(mn`v' == .)` : Here we start specifying the values that will be considered missing. As mentioned above, if the mean of a set of cases is missing (`mnx1 = .`), then we want the sum in the collapsed data to also be missing. This command will accomplish that task.
9. `replace sum`v' = . if(seq > 1)` : Now, just to make sure we don’t accidentally double count our sums in the data, this line of code replaces all values of `sum`v'` variables that have `seq` values greater than [1] with missing (.) values. Thus, only one case per value of `id` will have a value. The rest will be missing, and will evaluate as zeros in the `-collapse sum()-` function.
10. `replace sum`v' = -99 if (c2`v'~=0)` : Finally, since we want the presence of any missing value in the original data to produce a missing value in the collapsed data, we need to identify values of `id` that have missing values in their constituent cases. The `c2`v'` variables will tell us if there are any missing values. For example, `c2x1` is [0] for `id = 2` since there are no missing values in `x1`. But `c2x1` is [2] for `id = 5` since both cases are missing in `x1`.
11. The `-drop-` and `-rename-` commands are used to clean up some of the unneeded variables, and replace the original variables with our new variables that identify the missing values with `-99`.
12. Now that the `-foreach-` command is completed (see closing curly brace), we drop the `seq` variables since it is no longer needed.

Now, if you stop here and enter a `-list-` command, your data should look like this:

The only thing left to do at this point, is run a standard `-collapse-` command and replace the negative numbers with missing values. Note: you should use missing data values (e.g. -99 here) that are appropriate for your data and will allow you to identify missing data easily.

```collapse (sum) x1 x2 x3, by(id)
foreach v in x1 x2 x3 {
replace `v' = . if(`v' < 0)
}
```

If all went well, you should get a final data set in which any missing value in the original data would trigger a missing value in the collapsed data:

## The Hybrid Approach: Some Missing, Some Zeros

Now that you’ve seen the nuts and bolts of the code for customizing the `-collapse-` command in Stata, I’ll show you how to perform a hybrid aggregation.

We’ll use the same starting data set, and I won’t go into line-by-line detail on the code. But the following code block treats missing data in the following ways.

First, if all of the cases for a value of `id` are missing, then the collapsed value will be missing. Otherwise, if at least one cases has non-missing data, then any other missing values will be treated as zeros to preserve the non-missing data.

```bysort id: egen seq = seq(
foreach v in x1 x2 x3{
bysort id: egen sum`v' = sum(`v')
bysort id: egen mn`v' = mean(`v')
replace sum`v' = -99 if(mn`v' == .)
replace sum`v' = . if(seq > 1)
drop mn`v' `v'
rename sum`v' `v'
}
drop seq
collapse (sum) x1 x2 x3, by(id)
foreach v in x1 x2 x3 {
replace `v' = . if(`v' < 0)
}
```

If all went well, your final dataset should look like this figure:

## Conclusion

So there you have three ways to use the `-collapse-` command with the `sum()` function and missing data. While Stata programmers have done a great job with the software, there are times that the standard functional behavior of the program is not what you want. In those cases, these pieces of code might come in handy.

If you want to know more about how Stata is programmed to handle missing values – and more importantly, why – then I recommend two of Nick Cox’s Statlist postings on the subject:

http://www.stata.com/statalist/archive/2010-02/msg00430.html

http://www.stata.com/statalist/archive/2005-09/msg00944.html

Happy Coding!