By Mark Chandler
Ensuring data quality is a central pillar of engaging the public in doing science (e.g. community science, aka citizen science). The data collected by community scientists needs to meet certain quality criteria for it to be useful as an accurate representation of the world the researchers are investigating. The lay participant also wants to make sure his or her time and effort is put to good use, and that the data they collect is valid and useful.
How do we decide whether the data collected on our Urban Resiliency projects are “fit for purpose”?
Here’s one way: a team of well-trained people re-measure a sample of trees previously measured by public participants and see whether there is a difference. In December 2016, I joined a team of University California researchers to do just that – re-measure 30 trees in several parks that had been measured previously by members of the public. Guess what? The data collected by the public is almost indistinguishable from that which we measured.
Almost all data (> 93 %) on all variables (GPS location, tree diameter at breast height (DBH), canopy breath, percent permeable cover within 30 ft) collected by the public were essentially identical to that collected by the UCR research team. GPS location data was indistinguishable from data collected by the UCR team; DBH varied by less than 0.2 inches between the two measures; canopy breadth was off by ~ 1 ft on each side, and differences in permeability between volunteer and UCR team data was less than 6 %. There were very few if any outliers (where there was a > 20 % difference between values). This is a huge tribute to the diligence of the public in collecting data, but also to the rest of the team (UC scientists Peter Ibsen, Darrel Jenerette, Julie Ripplinger and Earthwatch staff Ellie Perry) who prepare the program and ensure the participants are prepared and supported.
In full disclosure, we have only managed to achieve this level of data quality after learning from past missteps. This was not the first time we had done a data quality test. In March 2016, we had re-measured 20 trees for these same variables. Here we found differing levels of data quality depending on how we supported the public participants. Groups of participants who were trained and supported while collecting their data — by a field team leader such as Earthwatch’s Ellie Perry or a trained leader from one of our partners — collected data of very high quality. However, participants with less support – or who trained using only our web site to learn the techniques – collected data with significantly lower quality (only 50 and 70 % of data was high quality). Using these lessons learned, we started focusing more on training certified citizen scientists – and providing more events at which trained leaders were present.
Achieving high levels of data quality using community participants is not unusual – but it does require significant effort and thought from both the participants side, but also the project creators (in this case UC Riverside scientists, Earthwatch and local partners). Re-measuring data collected by participants is but one step. In this case, we tackled the challenge of ensuring quality data in four phases:
- First we make a data plan that describes how the participants will be best prepared to collect quality data will be collected, and how the data will be shared and used post collection.
- Second, we prepare (train) and support participants during data collection events.
- Third, we use a series of tests to assess and ensure sufficient data quality, and filters that flag potentially erroneous (out of range) data
- Four – we learn from what has worked well (or not) – and amend our protocols and training accordingly to ensure we meet our desired standards.
These steps are actually best practice for doing science with or without the benefit of community participants. And just following the steps is no guarantee for success – as real differences exist across humans in their ability and desire to learn how to needed data.
Clearly there will be data collection techniques or approaches that require highly trained observers with more sophisticated skills and capabilities – and not everyone will be able or want to achieve that level of ability. Species identification skills are one example. However, scientists obviously do not hold a monopoly in attaining these levels of skill; many community/citizen scientists have expertise in natural history observations that most scientists never attain!
While the steps involved in assuring data quality can be time consuming and repetitive – testing and ensuring high quality project data elevates confidence in the project leads (e.g. partners, researchers, community leaders) and the participants. And it has been shown in various studies that increasing the confidence of the participants in executing their tasks also improves the data quality– forming a positive feedback loop in the project overall.
Luckily for the Urban Resiliency projects to date, we have found a community of researchers and public participants who bring the interest, passion and time to generate valuable and useful data. Thank you!
We are always looking for ways to improve and invite your comments and suggestions regarding data quality and community science projects. Just contact me directly at firstname.lastname@example.org.