Sitemap

NYT Open

How we design and build digital products at The New York Times.

Tracking Covid-19 From Hundreds of Sources, One Extracted Record at a Time

How The New York Times maintains a database of United States coronavirus cases and deaths pulled from a patchwork of regional health authorities.

The NYT Open Team
9 min readJun 17, 2021

--

Press enter or click to view image in full size
Illustration by Dalbert Vilarino

By Josh Williams and Tiff Fehr

As of this morning, The New York Times has made more than 9.98 million programmatic requests for Covid-19 data from websites around the world. The data we’re collecting are daily snapshots of the virus’s ebb and flow, including for every U.S. state and thousands of U.S. counties, cities and ZIP codes.

You may have seen slices of this data in the daily maps and graphics we publish at The Times, providing cumulative totals and 14-day trends that help readers see their local risk levels and outbreaks. Combined, these pages are the most-viewed collection in the history of nytimes.com. They’re a key component of the package of Covid reporting that won The Times the 2021 Pulitzer Prize for Public Service.

Internally, the depth and breadth of the data has been an invaluable reporting tool, helping us tell stories from the very first U.S. cases through the devastation of the winter wave through the first glimmers of good news as vaccines began to roll out.

The Times’s coronavirus tracking project was one of several efforts that helped fill the gap in the public’s understanding of the pandemic left by the lack of a coordinated governmental response. Another of these was Johns Hopkins University’s Coronavirus Resource Center, which collected both domestic and international case data. The Covid Tracking Project at The Atlantic marshalled an army of volunteers to collect U.S. state data, in addition to testing, demographics and healthcare facility data, each of which included thorny methodological challenges. These public data projects provided an essential and complementary set of sources for in-depth coverage.

At The Times, our coronavirus tracking work began with a single spreadsheet.

In late January 2020, Monica Davey, an editor on the National desk, asked Mitch Smith, a correspondent based in Chicago, to start gathering information in a Google spreadsheet about every individual U.S. case of Covid-19. One row per case, meticulously reported based on public announcements and entered by hand to tell the story of a person infected with the virus, with details like age, location, gender and condition.

We knew little then about the virus’s ability to spread, so our early case-by-case tracking followed a loose model established in the newsroom during other virus outbreaks. Covid-19 case counts in our early stories now seem unimaginably low: A March 1 story about the first confirmed case in Manhattan, for example, reported that “more than 80 people in the United States have been confirmed through laboratory testing.” Just five days later, another story put the total number of confirmed cases over 300.

By mid-March, the virus’s explosive growth proved too much for our workflow. The spreadsheet grew so large it became unresponsive, and reporters did not have enough time to manually report and enter data from the ever-growing list of U.S. states and counties we needed to track.

Building the Core Covid-19 Application

Just as offices (including ours) were shuttered and stay-at-home orders went into effect, many domestic health departments began rolling out Covid-19 reporting efforts and websites to inform their constituents of local spread. The federal government faced early challenges in providing a single, reliable federal data set.

Press enter or click to view image in full size
The Times’s database of Covid-19 cases and deaths is sourced from the websites of hundreds of state and county health authorities, using a combination of manual and automated tasks. Credit: Guilbert Gates/The New York Times

The available local data were all over the map, literally and figuratively. Formatting and methodology varied widely from place to place, from rudimentary PDFs to information-rich dashboards. Even the question of what constituted a “case” wasn’t uniform: some places reported confirmed cases, while others reported suspected ones; still others reported both or made no distinction.

Within The Times, a newsroom-based group of software developers was quickly tasked with building tools to augment as much of the data acquisition work as possible.

On March 10, 2020, the day before the World Health Organization declared the virus a pandemic, newsroom developer Will Houp wrote the first lines of code of a NodeJS-based web application capable of scraping Covid-19 data from a growing number of sources. In just a few days, Will, with teammates Ben Smithgall and Andrew Chavez, added features that enabled our journalists to edit, approve and apply custom math to the collected data.

On March 16, the core application largely worked, but the scraper directory sat empty. To tackle this massive effort, we recruited developers from across the company, many with no newsroom experience, to pitch in temporarily to write scrapers.

By the end of April, we were programmatically collecting figures from all 50 states, nearly 200 counties and many ZIP codes, census tracts and cities. The pandemic, our code complexity and our database all seemed to be expanding exponentially, which raised system-architectural concerns on a number of fronts.

We learned early that individual scrapers needed to be strategically fragile: We wanted them to break when the source website changed or failed to respond, to alert us so that our team could figure out whether the issue was with our code or the source website’s changes. In the early days of the pandemic, a few notable sites changed several times in just a couple of weeks, which meant we had to repeatedly rewrite our code.

Between scrapers breaking and a seemingly never-ending list of new target sites, it was time to staff up for real. We needed dedicated developers who could focus their full attention on the project, allowing them to not only maintain the scrapers, but also build up the core application and make the scapers leaner, testable and faster to develop.

From mid-June 2020 on, we have staffed the scraping team with about six developers at any given time, most with prior journalism and data experience. The scraping team was a critical part of the newsroom’s virus tracking project, which would ultimately pull in more than 100 journalists and engineers from across The Times.

As many as 50 people beyond the scraping team have been actively involved in the day-to-day management and verification of the data we collect. Some data is still entered by hand, and all of the data is manually verified by reporters and researchers. This has been a seven-day-a-week operation since March 2020.

Get The NYT Open Team’s stories in your inbox

Join Medium for free to get updates from this writer.

In addition to publishing data to The Times’s website, we made our data set publicly available on GitHub in late March 2020 for anyone’s use.

News Judgment, Data Quandaries and Dashboard Changes

The Covid-19 outbreak was an unprecedented challenge for local officials in collecting and presenting public health information quickly. As outsiders hoping to aggregate and automate this data, we watched in real time as health departments across the country scrambled to get websites up.

As our Covid-19 scraping effort grew, our newsroom developers gained just as much fluency in the subject matter as The Times’s reporters. We were trusted and expected to exercise news judgment on a daily basis, puzzling out tricky questions: What page on a state’s website had the most recent and most methodologically clear figures? Which figures meant what? Did we think officials were doing math we could reverse-engineer and confirm? Should we reach out to the state or county to confirm details?

Meanwhile, our targets kept changing.

In all but a handful of cases, the websites for the first few months were being continuously tweaked, requiring us to constantly edit and rewrite our scrapers. Broadly, we saw a pattern of sites transitioning from more rudimentary presentations — like PDFs, hand-edited tables or freeform text descriptions of case numbers — to feature-rich dashboards, usually built with business intelligence tools from major vendors like ArcGIS, Tableau and Microsoft’s Power BI products.

We witnessed hundreds of health departments upgrade their presentations of complex, geographic and public health data across weeks and months.

Public Data !== Open Data

For years, civic-minded organizations who identify with the “open data” movement have pushed governments to release, in usable form, publicly funded data for research and reuse. At all levels of government, this has started to happen. These data portals provide citizens, journalists and researchers access to vast, documented data sets, often with tools to query and visualize them.

The pandemic has made clear, though, that there’s a massive gulf between the ideal of open data and the reality of public data. At a moment when journalists and researchers needed timely access to public health data, the federal government had little to offer, and few local governments were prepared to publish open, accessible data on their own.

Even now, about 18 months in, a very small percentage of our programmatic requests are extracting data from well-maintained, documented, queryable APIs.

And as jurisdictions’ sites grow more sophisticated, the task of scraping data from them can become more complicated. Early Covid-19 sites in plain HTML or PDF format were brittle and ever-changing, but they were usually easy to parse. As sources switched to commercial dashboards, whose underlying architectures vary greatly, it was often more difficult to get the data.

In the best of cases, a dashboard has an easily findable data endpoint with vendor documentation for how to query and retrieve the underlying data as JSON. The problem is, these systems are documented for the providers of the data. The data itself isn’t documented from the municipalities, so even when we find it, there’s often a lot of work to figure out how it relates to data shown in the dashboard.

More typically, commercial dashboards aren’t meant to be queried independent of the pre-configured presentation. In the context of tools designed for business intelligence, this is perfectly reasonable: Most dashboards are private and their metrics are very specific to each company’s needs. But when employed in the service of public health data portals, their data formats are unfriendly to journalists, researchers or citizens who may need to download it.

We suspect none of this friction is intentional. It is reasonable to assume that many health departments have no idea we are pulling their data several times a day, and understandable that their primary concern is their constituents, not the needs of a handful of data aggregators.

Winding Down

From elections to the Olympics, our newsroom developers help The New York Times report and present the news. In most cases, there’s a natural shift in tempo when we transition our effort from active development to maintenance or retirement. For Covid-19, there’s no obvious moment on the horizon, but things are changing.

As vaccinations curb the virus’s toll across the country, we now see a number of health departments and other sources dropping back their staffing and updating their data less often.

A drastic reduction in active cases is great news, which has meant that some of our own custom data collection could be shut down. Since April 2021, our number of programmatic sources has dropped nearly 44 percent.

This was possible in part due to some long-awaited data from the federal government, updated more quickly than what we saw in 2020. In February 2021, the Centers for Disease Control and Prevention expanded their reporting to include comprehensive, county-level figures that had only been partly available in 2020. We are now able to use the new C.D.C. files for much of our county-by-county virus reporting.

Our goal is to get down to about 100 active source targets by late summer or early fall, mainly for tracking potential hot spots. The falling number of cases has helped us return to more normal staffing levels and to rebalance our work and lives, just as our readers are doing.

A version of this article was published on Times Insider and in print on June 24, 2021 in Section A, Page 2 of the New York edition of The New York Times.

Tiff Fehr (@tiffehr) is a staff engineer and project lead with the Interactive News team, a group of technologists embedded in the newsroom of The New York Times. She focuses on newsroom-focused custom software development initiatives, including The Times’s Covid-19 real-time data-acquisition pipeline. She previously worked at msnbc.com, as well as with a few Seattle startups.

Josh Williams (@sjwilliams) is a multimedia editor in the Graphics Department of The New York Times. He works across desks on a variety of visual stories, primarily as a designer and programmer. Before joining The Times in 2011, Josh taught at U.C. Berkeley, led a team of newsroom designers and developers at the Las Vegas Sun, and made exhibits at the Smithsonian Institution.

Tiff and Josh were leaders on The Times’s Covid-19 data collection team, whose efforts were at the heart of the work that won the Pulitzer Prize for Public Service in 2021.

--

--

How we design and build digital products at The New York Times.

We’re New York Times employees writing about workplace culture, and how we design and build digital products for journalism.