When spring break fell through, I help track the coronavirus for millions in the US

Peter Sun
6 min readMar 12, 2020

Disclaimer: everything in this article is solely my personal opinion based on public sources. My writing does not represent the position of 1point3acres, the organisation I volunteer at. My work is 100% voluntary and is not paid by any party.

Exactly seven days ago, I pulled the trigger to cancel the long-anticipated spring break trip to Peru right before the country declared its first case of coronavirus. Safety aside, one big reason that prompted me to make the call is an email that came in hours before.

The email comes from 1point3acres, one of the largest first-generation Chinese online communities in the United Sates. They invited me to the data team of a real-time interactive map and tracker of the covid-2019 outbreak in the US.

Since then it has been an exciting (and sometimes stressful) journey. I found myself working nine hours a day during my senior spring break for one of the only volunteer-run trackers of this fast-developing public emergency in the US, which has seen 20,000,000+ visits and is acknowledged on Twitter.

The tracker (link here) has cumulative data of all the confirmed positive cases in the US. Leveraging on web scrapping, bots and careful manual scrutiny, we update real-time covid-19 patients case by case down to the county level and then present them in an interactive map.

What the tracker looked like on 3/7 — we have since added new features!

In addition, the database also aggregates headlines from all over the US as well as helpful information such as university closures and company work-from-home policies.

Don’t forget to hit the button and buy us some boba!

As I reach the first week mark, I realised that this project has given me a uniquely broad but detailed perspective of how the covid-19 situation unfolds in the US. Catching the few hours of the day before new cases pile up, I would like to share my first reflection as my role continues in this initiative.

The lack of standardised data practices caused confusion initially, but local authorities are gradually picking up their game

When I started the job on March 6, I was overwhelmed by the disparity between different states in the way they publicise the latest information. Since the CDC stopped providing national data and the bulk of testing were to be done locally instead, there was a period of confusion on who should do the math. At one point there was different data by the county, the state and the news outlets in Washington State and California, making it extremely difficult for non-government observers like me to understand the actual situation. To make things more complicated, there was little consensus either on how cases should be reported. While Washington state announced cases that have been tested locally and came back positive as “presumptive positive” (which is the standard practice now — good job!), South Carolina insisted on calling them “potential” cases on 3/6. This inconsistency in reporting could have confused the public. Given it takes up to 48 hours for the sample to travel to CDC for the authority to double check, the downplay of the situation could have delayed public response too. Releasing incoherent if not contradicting figures is just as unhelpful for the general public.

A patient was transported from Life Care Center in Kirkland, Wash., on Wednesday. David Ryder/Reuters

Fortunately, I’ve been seeing signs that authorities are picking up their games. For one thing, more than half the states by now have set up clear, daily updated data tables in similar formats. Presumptive positives are reported as they are with detailed explanation of what it means. Governments of different levels in some states have created data tables that synchronise each other’s records, so I don’t have to dig around 10 .gov’s and scratch my head over 8 different numbers. The increasing testing capacity is also felt in the behaviour of the numbers. When I first started, updates on new cases trickled after dinnertime; as of 3/10, they continued to stream through by midnight. Although we are starting to see an acceleration in confirmed cases, bear in mind this is also a positive sign that authorities of all levels are gearing up.

The problem of not having a centralised database will get worse as the number of cases explodes

Since I joined the number of new cases per day has doubled in four days, and I witnessed first-hand how an influx of data could obstruct precise understanding of the situation. The covid-2019 outbreak is about the stories of each and every patient as much as the overall count. Simply flipping the compiled cases, deaths and cures is not enough for a comprehensive representation of the situation.

Unfortunately, traceability of individual cases in the public domain is the first thing to go when the numbers explode. Media and authorities can describe one patient in multiple ways, and sometimes important updates of existing patients (such as death) are not tied to the same description. When cases were growing slower a few days back, this disparity already made it time-consuming to update existing records and caused double counting. Now that more than 40 cases can come in one report, it is almost impossible for me to trace who they are even if I crunch the numbers like an investment banking monkey. I 100% agree with the authorities to withhold excessive patient information for privacy; but for the information deemed appropriate to release, can we make sure the same patient is described in the same way, so I know the “woman in her 70s” yesterday is not the “resident who visited Italy”?

My most recent observation on 3/11 is that the (hyper-)attention by the media is making this disparity worse. In addition to government updates, media now report updates from hospitals, office towers, companies and almost any entities that spotted a case. Yes — everyone should be aware that the next case could be around you, but the reporters could at least fact-check if this is the same person reported previously. Instead, a patient who works at building A, lives in county B and then went to hospital C seems like three new cases, which unnecessarily triples the scare for the public that click on three links. Only if there was a centralised database everyone can check their calculations against, or every patient has a unique case number attached to all his / her updates.

This work can be emotionally hard

As the initial rush to work worn off and the learning curve flattened out, I found myself increasingly susceptible to the tragedies on an individual level beneath each of the figures I punched. One moment in the morning of 3/9 really struck. I was updating a new death in Washington State just when I realised that I entered her record when she was diagnosed two days ago. Since her demographic details were released upon death, suddenly “case #299” became a grandma who was supposed to be in a rocking chair in her 90s to me. “A woman in her 90s” — five words were enough to make me take a break with tears in the eyes. She is not “Washington State + 1” in the record, but a lady just a few years older than my very own grandparents. Since then under my “UPDATE” entries, I quit using “death reported on [date]” but instead wrote “the patient passed away on [date]”.

The grandma used to live here. PHOTO: JASON REDMOND/REUTERS

So what’s next? I see myself deepening my commitment on this awesome team, and I am contemplating two initiatives: to provide our data to the wider audience by expanding social media presence (shameless plug again: real-time covid-2019 tracker), and to update more insights working in the team on Median. Stay tuned!