Science

The Race to Save Climate Change Data Runs Through NYC

Citizen archivists are using their tech skills to save data from the government.

Feb. 4, 2017

Jerome Whitington is standing in a dimly lit, warehouse-y room with roughly 100 other people. He is a slight, middle-aged, professorial man with a soft voice and a teal shirt that says “Data Rescue NYC.” Whitington is not a man comfortable speaking in front of crowds — his hands visibly shake when he addresses the buzzing, flannel-clad group — but he gets steadier as he begins talking about EPA nominee Scott Pruitt, a prominent climate change denier, and Secretary of State Rex Tillerson, the former CEO of Exxon.

“I hope you like earthquakes, because they think it’s normal.”

The earthquakes Whitington is referring to are fracking-induced ones, but he might as well be talking about the surge of interest in joining his particular cadre within the broader resistance to the Trump administration. It is a bitterly cold Saturday morning and the crowd at New York University’s Tisch School of the Arts is as big as expected. These people, many of them paid well enough to be spending this time in the neighborhood’s brunch restaurants, have come to make sure that the public will have continued access to government climate change data. It’s a difficult task the assembled crowd aims to accomplish quickly: The Trump administration began to remove climate change data from the public website within hours of the inauguration ceremony.

“You guys are so chipper for a Saturday morning!” a woman carrying a styrofoam cup of coffee squeals above the sound of her sneakers squeaking on the hardwood floor. The group she’s approaching chuckles politely, but few look up. Their faces glow blue. They are focused.

The bring-your-own laptop gathering comes after similar events at universities across North America — the University of Toronto and the University of Pennsylvania have hosted similar events — but the group assembled in New York is arguably the biggest, most technically proficient and academically diverse group yet. It has attracted people from outside of typical hacking and archiving circles.

It wasn’t always this way, though. Organizers describe the weeks immediately after the election as haphazard and frantic — people started simply downloading web pages and forwarded them to universities and environmental activist organizations.

The Environmental Data Governance Initiative (EDGI) seems to have rapidly created order from chaos. The specificity of the goal has helped. Government environmental data informs testimony during hearings on regulation and this information — specifically datasets from NASA — have been used since the mid-1980s to demonstrate the severity of the threat posed by climate change.

By Saturday’s event at Tisch, EDGI’s organization prowess had become apparent. Designated rooms housed librarians skilled in data archiving, software engineers were methodically reviewing government domains (like the EPA). The Internet Archive normally acts as the protector of governmental data between administrations by archiving data between administrations before the next one comes in and potentially deletes major portions to support new legislation and platforms, a sort of repository of inconvenient truths. But the Internet Archive hasn’t been able to handle the overflow and sheer weight of work.

The event attracted New Yorkers with no ties to environmental organizations. They are indistinguishable from the hip crowd strolling just outside on Broadway. John Sockwell is part of this group. He is a product strategist at an ad agency. He’s wearing a green and blue plaid flannel shirt that compliments his trimmed russet beard and tapping at his computer. He’s seated at the corner of a long table crammed with other archivists. It’s his first event. He’s here with colleagues.

“I grew up in the Pacific Northwest, just outside of Seattle,” he says before adding that he’s “just a human.” He shrugs. This is clearly a no-brainer for him.

Sockwell is on the EPA’s National Service Center for Environmental Publications, helping to write a tool to download the data. EDGI has provided some starters to create a tool to capture the links; at the end of each page is a PDF, which their tool will crawl to download it, upload it to a server, and then make its information available to anyone who wishes to access it.

Next to Sockwell is a lanky man in a baseball cap. He doesn’t want to be identified, but he’s stumbled upon a giant dataset called “Superfund,” a program administered by the EPA that “locates, investigate[s] and clean[s] up hazardous sites throughout the U.S.” He shows me a single file for Nebraska, which has nearly two thousand entries for locations of hazardous waste sites. The event’s organizers huddle around him, marveling at the sheer amount of data that he’s stumbled across. “Won’t that crash the site?” the man asks hesitantly. “That’s an entire dataset,” a guy in a teal shirt responds, advising him to focus on this particular dataset today. He won’t be able to singlehandedly go through all the data, he tells me, but it’s better than not having any of the data saved.

“The administration is chomping at the bit to take data off the internet,” says Whitington, a professor of anthropology at New York University and the organizer of the event. “It’s not going to happen in one day, it’s not going to happen in one week, but it’s going to happen.” Whitington describes EDGI’s events as “archivathons,” a race against time to save as much information as possible.

Matt Price, the technical lead of the project and a history of information studies professor at the University of Toronto, says people have been data archiving for the past few presidential terms, but the election of Trump created an almost panicked sense of urgency. “If the change weren’t from Obama to Trump, it would have been a simple task done by librarians,” Price said. “But this is a totally different situation.”

Tracking that information in itself is tedious. Margaret Janz is the data curation librarian at the University of Pennsylvania and has been involved with previous archiving events around the country. She says the best way to think about the events is as “feeding” and “sorting.” The feeding part, Janz says, is pretty straightforward, involving a web crawler, or a program someone develops to spot certain details and information. “There’s a web crawler, and basically what happens is that it crawls around the internet and captures screenshots of the page, and it also grabs the links the links on those pages so you can basically replay the whole site,” she explains. Those screengrabs and links are then put into End of Term Presidential Harvest, the collection of sites saved from an administration’s four-year term (in this case, President Barack Obama’s second term).

But the web crawler only goes so deep — the first layer of links and information is saved, but subsequent layers might not be, and some sites simply have too much data or it’s just too hard to search all the embedded data within the site. “Think of the web as an ocean: The archivists are just crawling across the top, they’re just a sailboat,” Price says. “The top is not the problem, the deep ocean floor is, so to speak.”

That’s where the archivist events come in. Volunteers go through a list of URLs to find the ones that fall under the “too big” umbrella (like the hazardous waste site repository the anonymous tall volunteer had stumbled on), writing programs that will save the data.

“This is the ‘hacking’ part,” Janz said, air-quoting the verb. “They pick something off the list and they start retrieving from the site.” It’s not as simple and straightforward as it sounds and involves careful combing through pages and pages of numbers and lists, Price says: “The government is really huge, and even within the government, climate change is a really huge topic with tons of agencies and subgroupings.”

Once the data is retrieved, data librarians scrutinize the information before dumping the information onto Amazon servers, which are locked and protected — only people at the University of Pennsylvania can access them (the University of Michigan has a similar repository, but those are the only two within the United States). The public can search the data within the Data Refuge, but EDGI builds in a further layer of protection by shooting the data over to the University of Toronto’s servers.

That action isn’t a direct response to the Trump administration — “we’ve been meaning to do that for a long time,” Janz insists. But creating such a protected repository of data is meant to create a system that is both safe and, ironically, inaccessible to potential hackers who might want to steal the carefully salvaged data.

It might not make sense why archivists are meeting up now — about two weeks after Trump’s inauguration — to collect data. Whitington, however, said data deletion takes time, as previous administrations — most notably, that of George W. Bush and Dick Cheney — demonstrated. “They have different priorities than we have,” Whitington said. He clarified that data erasures wouldn’t necessarily take the form of a person systematically deleting pages; what the Trump administration is more likely to do is to strangle funding for maintenance of departments and positions that would normally maintain pages — “That way, it’s difficult to track moves,” Whitington explained.

To combat that, Whitington says EDGI and other watchdog groups are trying to track line item budgets and server maintenance logs. Data requires a person to maintain and update data, so EDGI keeps track of recalls: Has a person been removed from their duties? Has regular updating of a site suddenly stalled? These are indicators that data could disappear. It’s not easy, and it’s not a perfect method, Whitington acknowledges: “This stuff is really hard to track. When people get reassigned, you lose that expertise.”

Canada has been heralded as a safe bastion for the science community’s mutiny (though it’s technically against Canadian law to host information that “relates to acts of terror around the world”). The University of Toronto has the most prominent role in digital archival of climate change data. In fact, the University of Toronto is the epicenter of the environmental data movement — two technoscientists based there, Michelle Murphy and Patrick Keilty, were early promoters of guerrilla archiving, safekeeping the Canadian servers that “mirror” and storing data collected in the U.S. with cryptographic caches and a peer-to-peer element between institutions that ensures safety.

It’s an ironic, though fitting, twist in the most recent history of scientific fact safeguarding, which can be traced to Prime Minister Stephen Harper’s administration silencing and erasure of climate change science. “They found it inconvenient for [Canada’s] oil and gas driven economic program,” Price said. “We saw data disappear then, the closure of libraries, data moving offline, government data collection being canceled.” To Price and other self-identified technoscientists, it’s a chance for Canada to take the lead in avoiding a potentially worse catastrophe than the one that afflicted their homeland just a few short years ago.

“Not everything will be saved,” Price acknowledges (EDGI has thus far saved about 24,000 web pages). “This is something we should have been doing even before we went into crisis mode. But the silver lining is that what we’re doing is creating democratic tools for a future where the people can ultimately manage the world’s knowledge.”

For now, the archivists are intent on salvaging as much information as possible. Volunteer programmers huddle over their computers, furrow their brows, and drink coffee as though each sip is a minor protest. The New York City tech community is probably best positioned to save climate change data, and while the mood is congenial, every movement takes on a sense of urgency. Desperation moves fingers across keyboards.

“Research on environmental data is the only thing saving us from an alternate reality,” says Whitington.

Related Tags