Massachusetts Covid Breakdown by Age Part I: Methodology

Since late August, I’ve wanted to perform an analysis of Covid cases, hospitalizations, and deaths by age cohort. Unfortunately, the reporting of this information by the state is (1) not transparent (2) internally inconsistent, and (3) sometimes clearly incorrect. I’ve spent time over the last month attempting to compile data from the weekly public health reports to which this information has been relegated. That work has been frustrating, to put it mildly.

The remainder of this post details the issues with the age cohort data provided by the state, and what I’ve done to calculate estimates of these important coronavirus statistics from the data provided. Parts of this post are technical in nature, so skip to the following posts for the bottom line.

Through August 11, the state provided a daily summary of cumulative cases, hospitalizations, and deaths by eight different age cohorts (and one group for unknown age) on its dashboard. Since then, all information by age cohort has been included only in the weekly public health report. In addition, the state dropped its cumulative reporting, and now provides age-based summaries for the prior two weeks only. This makes weekly tracking difficult, as one week rolls off and a new week is added to the summary in each report.

Fortunately, the state continues to provide a daily breakdown of cumulative cases, hospitalization, and death counts by race/ethnicity. Through August 11, the race/ethnicity total counts matched the total counts in the age cohort report as well as the aggregate totals for cases and deaths shown in the dashboard (confirmed and suspected).  After August 11, the state dropped the reporting of suspected cases and deaths from the daily dashboard as well. 

As an aside, the data aggregators covid tracking ( and worldometers ( began to use the race/ethnicity report to tabulate cases and deaths in Massachusetts, as it was only source of confirmed and suspected cases available on a daily basis. (The state added back probable cases and deaths to the daily Dashboard report in early September, but these data are no longer on the front page).

The race/ethnicity totals match the case and death totals reported by the state each day, but the two week totals in the weekly public health report by age cohort do not match the figures for the equivalent period in the race/ethnicity report. Table 1 shows these discrepancies starting August 8.

Table 1: Massachusetts Reporting of Total Cases, Hospitalizations and Deaths
Comparison of Weekly Public Health Reports to Daily Race/Ethnicity Report
August 8 to October 3, 2020
  From Daily Reports   From Weekly Reports
Two Weeks Ending Cases Hospitalized Deaths   Cases Hospitalized Deaths
8-Aug 5443 231 211   3912 116 14
15-Aug 5159 240 200   4856 107 180
22-Aug 4649 212 200   4728 82 200
29-Aug 4476 186 208   4398 78 200
5-Sep 4830 196 220   4716 91 190
12-Sep 4570 174 187   4785 81 176
19-Sep 4985 190 179   5126 97 184
26-Sep 5510 211 195   5947 124 202
3-Oct 7122 208 212   7672 133 223

Table 1 clearly shows the mismatch between the totals from the two reporting sources.  In particular, the death total for August 8 (this is not a typo), and the new hospitalization totals for the entire period stand out as particularly inconsistent.  Hospitalizations appear to be significantly under reported in the weekly report, both in comparison to the race/ethnicity report and to the new hospitalizations reported independently by hospitals (not shown here). 

Calculating accurate estimates is complicated by another factor: on September 2, the state changed its definition of probable cases and eliminated 8,050 cases, 26 deaths, and roughly 100 hospitalizations from the historical count.  Fortunately, the state did provide a back history for the changes in cases and deaths, so that these figures can be adjusted accordingly.  The state did not provide a back history for change in hospitalizations, so the 100 figure is an estimate. And while the state did provide a back history for total cases and deaths, it did not provide revised figures by age cohort.

This data definition change is why Table 1 is broken into three three-week periods.  The first period, through August 22nd, is before the state made the change, so the figures shown for those dates are the actual numbers as reported, not the adjusted numbers,  in order to show equivalent totals for comparisons between the two sources.  (In my estimates later, I do adjust all figures downward). 

The second period is a “transition period” that reflects these definition changes.  The August 29th figures from the weekly reported (released on September 2nd) were already adjusted, but the daily reports are not.  The following two weeks, through September 12th, contain data both before and after the definition change.  Therefore, the weekly data is as reported, and the daily data has been adjusted through September 1 to reflect the case definition changes. 

The figures for the he final three-week period are as reported, because the case definitions for both reports are aligned once again. (In this final period, the weekly reported figures for cases and deaths are always higher than the daily reported figures.  It almost appears that the state is erroneously using a 15 day total, rather than a 14 day total).

To reflect all of this, I used the following approach to estimate cases, hospitalizations, and deaths by age cohort.  First, all the data prior to September 2nd has been adjusted to reflect the definition change for probable cases.  Second, prior to and including August 8th, I derived weekly figures by simply summing daily figures.  (This means that I do not have to rely on the August 8th weekly public health report, as the 14 deaths reported there are clearly wrong).  Finally, starting August 15th, I used the following approach:

(1)  For each two week period, calculate the total number of cases, hospitalizations, and deaths over that period from the race/ethnicity report.

(2) Scale the age cohort figures in the weekly report for each statistic so that the totals calculated match that for the same period from the race/ethnicity report.  For example, suppose there are 200 total deaths over a two week period from the race/ethnicity report, but 160 deaths reported for that same period in the weekly age cohort report.  Furthermore, suppose there are 20 reported deaths for people aged 60 through 79.  This means that I calculate 24 deaths for that age cohort for that period (200 / 160 * 20).

(3) Subtract off the figures calculated for each age cohort for the prior week for each statistic to derive an estimate for the current week.  Because I have actual daily data from the race/ethnicity report for the week ending August 8th, I have a starting point for the August 15th calculation.

This approach ensures two things.  First, the percentages by age cohort for each statistic are preserved for each two-week period.  Second, total cases, hospitalizations, and deaths match the totals reported by the state for each two-week period. 

The third step, the subtraction, seems to lead to more volatile weekly changes than one might expect, and is probably the weakest part of the approach.  This is particularly true for hospitalizations, for which the weekly data is most suspect.  In fact, for the August 15th calculation, which blends together daily age data with weekly data, a naive implementation leads to negative hospitalizations and deaths for the 80 plus group for that week.  Quite simply, I fudged some numbers  there to make the numbers seem more reasonable.

The next several posts will use the estimates calculated this way to analyze information about cases, hospitalizations, and deaths by age cohorts.


Leave a Reply