Nai-Ching Wang, a Ph.D. student advised by Dr. Luther, successfully defended his dissertation today. His dissertation is titled, “Supporting Historical Research and Education with Crowdsourced Analysis of Primary Sources”, and his committee members were Dr. Luther (chair), Ed Fox, Gang Wang, and Paul Quigley, with Matt Lease (UT Austin School of Information) as the external member. Here is the abstract for his dissertation:
Historians, like many types of scholars, are often researchers and educators, and both roles involve significant interaction with primary sources. Primary sources are not only direct evidence for historical arguments but also important materials for teaching historical thinking skills to students in classrooms, and engaging the broader public. However, finding high quality primary sources that are relevant to a historian’s specialized topics of interest remains a significant challenge. Automated approaches to text analysis struggle to provide relevant results for these “long tail” searches with long semantic distances from the source material. Consequently, historians are often frustrated at spending so much time on manually the relevance of the contents of these archives other than writing and analysis. To overcome these challenges, my dissertation explores the use of crowdsourcing to support historians in analysis of primary sources. In four studies, I first proposed a class-sourcing model where historians outsource historical analysis to students as a teaching method and students learn historical thinking and gain authentic research experience while doing these analysis tasks. Incite, a realization of this model, deployed in 15 classrooms with positive feedback. Second, I expanded the class-sourcing model to a broader audience, novice (paid) crowds and developed the Read-agree-predict (RAP) technique to accurately evaluate relevance between primary sources and research topics. Third, I presented a set of design principles for crowdsourcing complex historical documents via the American Soldier project on Zooniverse. Finally, I developed CrowdSCIM to help crowds learn historical thinking and evaluated the tradeoffs between quality, learning and efficiency. The outcomes of the studies provide systems, techniques and design guidelines to 1) support historians in their research and teaching practices, 2) help crowd workers learn historical thinking and 3) suggest implications for the design of future crowdsourcing systems.
Our research investigating the use of crowd workers to analyze satellite imagery of tree canopy coverage was accepted as a poster for the American Geophysical Union (AGU 2018) fall meeting in Washington, DC. The lead author is Forestry Ph.D. student Jill Derwin, with co-authors Valerie Thomas, Randolph Wynne, S. Seth Peery, John Coulston, Dr. Luther, Greg Liknes, and Stacie Bender. The abstract for the poster, titled “Validating the 2011 and 2016 NLCD Tree Canopy Cover Products using Crowdsourced Interpretations“, is as follows:
The 2011 and 2016 National Land Cover Database (NLCD) Tree Canopy Cover (TCC) products utilize training data collected by experienced photo interpreters.. Observations of tree canopy cover were collected using 1-meter NAIP imagery overlaid on a dot grid. At each point in the dot grid, experts interpreted whether the point fell on canopy or not. The proportion of positive observations yields percent canopy cover. These data are used in conjunction with a set of 30-m resolution predictors (primarily Landsat imagery) to train a random forest model predicting TCC nationwide. We will test the use of crowdsourced observations of canopy cover to validate national products. Crowd-workers will apply the same training data photo interpretation methodology at plot locations across the United States subsampled from the public Forest Inventory and Analysis database . Each plot will have repeated samples, with multiple crowd observers interpreting each location. Using a multi-scale bootstrap-aggregation or ‘bagging’ approach at the plot- and dot-levels, we randomly select sets of interpretations from randomly chosen interpreters to train consecutive models. This bagging methodology is applied at both the plot level as well as the individual dot observations to test the within-plot crowd-sourced interpretation variance. We will compare the NLCD TCC models from 2011 and 2016 to multiple bagged samples and aggregated quality metrics such as the coefficient of determination and root mean square error to evaluate model quality. We will also compare these bagged samples to independent expert interpretations in order to gain insight into the quality of crowd interpretations themselves. This work provides insight into the utility of crowdsourced observations as validation of national tree canopy cover products. In addition to comparing aggregated crowd interpretations to expert measurements, identifying conditions that result in disagreement in interpreters’ observations may help to inform the methodology and to improve interpreter-training for the crowdsourcing task.
Investigators have enlisted the help of the public since the days of the first “wanted” posters, but in an era where extensive personal information, as well as powerful search tools, are widely available online, the public is increasingly taking matters into its own hands. Some of these crowdsourced investigations have solved crimes and located missing persons, while others have leveled false accusations or devolved into witch hunts. In this talk, Luther describes his lab’s recent efforts to develop software platforms that support effective, ethical crowdsourced investigations in domains such as history, journalism, and national security.
Crowdsourcing more complex and creative tasks is seen as a desirable goal for both employers and workers, but these tasks traditionally require domain expertise. Employers can recruit only expert workers, but this approach does not scale well. Alternatively, employers can decompose complex tasks into simpler micro-tasks, but some domains, such as historical analysis, cannot be easily modularized in this way. A third approach is to train workers to learn the domain expertise. This approach offers clear benefits to workers, but is perceived as costly or infeasible for employers. In this paper, we explore the trade-offs between learning and productivity in training crowd workers to analyze historical documents. We compare CrowdSCIM, a novel approach that teaches historical thinking skills to crowd workers, with two crowd learning techniques from prior work and a baseline. Our evaluation (n=360) shows that CrowdSCIM allows workers to learn domain expertise while producing work of equal or higher quality versus other conditions, but efficiency is slightly lower.
The increasing volume of text data is challenging the cognitive capabilities of expert analysts. Machine learning and crowdsourcing present new opportunities for large-scale sensemaking, but we must overcome the challenge of modeling the overall process so that many distributed agents can contribute to suitable components asynchronously and meaningfully. In this paper, we explore how to crowdsource the sensemaking process via a pipeline of modularized steps connected by clearly defined inputs and outputs. Our pipeline restructures and partitions information into “context slices” for individual workers. We implemented CrowdIA, a software platform to enable unsupervised crowd sensemaking using our pipeline. With CrowdIA, crowds successfully solved two mysteries, and were one step away from solving the third. The crowd’s intermediate results revealed their reasoning process and provided evidence that justifies their conclusions. We suggest broader possibilities to optimize each component, as well as to evaluate and refine previous intermediate analyses to improve the final result.
Dr. Luther and his frequent collaborator Ron Coddington, editor and publisher of Military Images magazine, gave an invited presentation on Civil War photo sleuthing at the 18th annual Image of War Seminar in Alexandria, VA, hosted by the Center for Civil War Photography. The presentation included a brief history of American Civil War photography and a live demonstration of the Civil War Photo Sleuth website.
Dr. Luther gave an invited keynote presentation at Vietnam War / American War Stories: A Symposium on Conflict and Civic Engagement, hosted by the Institute for Digital Arts & Humanities at Indiana University-Bloomington. Other keynote speakers included included David Ferriero, the Archivist of the United States; and John Bodnar, Distinguished and Chancellor’s Professor of History at IU. Dr. Luther’s presentation was titled, “Rediscovering American War Experiences through Crowdsourcing and Computation,” and the abstract was as follows:
Stories of war are complex, varied, powerful, and fundamentally human. Thus, crowdsourcing can be a natural fit for deepening our understanding of war, both by scaling up research efforts and by providing compelling learning experiences. Yet, few crowdsourced history projects help the public to do more than read, collect, or transcribe primary sources. In this talk, I present three examples of augmenting crowdsourcing efforts with computational techniques to enable deeper public engagement and more advanced historical analysis around stories of war. In “Mapping the Fourth of July in the Civil War Era,” funded by the NHPRC, we explore how crowdsourcing and natural language processing (NLP) tools help participants learn historical thinking skills while connecting American Civil War-era documents to scholarly topics of interest. In “Civil War Photo Sleuth,” funded by the NSF, we combine crowdsourcing with face recognition technology to help participants rediscover the lost identities of photographs of American Civil War soldiers and sailors. And in “The American Soldier in World War II,” funded by the NEH, we bring together crowdsourcing, NLP, and visualization to help participants explore the attitudes of American GIs in their own words. Across all three projects, I discuss broader principles for designing tools, interfaces, and online communities to support more meaningful and valuable crowdsourced contributions to scholarship about war and conflict.
Congratulations to Crowd Lab undergraduate researcher Anne Hoang for winning 3rd place in the Faculty Choice category at the 2018 Virginia Tech Undergraduate Research in Computer Science (VTURCS) Symposium. There were 37 submissions including 22 capstone projects and 15 research projects.
Anne also won 1st place in the Capstone category and 1st place in the Marston Awards (industry pick) category. Amazing!
The Crowd Lab regularly participates in the VTURCS Symposium. Last year, our teams placed 1st and 3rd in the Faculty Choice category.
Our preliminary work on the Civil War Photo Sleuth project, which combines crowdsourcing and face recognition technology to identify unknown American Civil War soldier photos, was accepted to ACM Collective Intelligence 2018 in the most competitive oral presentation category (32% acceptance rate). We’ll be traveling to Zurich, Switzerland to present this work. The extended abstract is available online.
Congratulations to co-authors Crowd Lab Ph.D. student Vikram Mohanty and computer science undergraduate David Thames.