Session Block 2 – Wednesday, June 3, 13:15-14:45
B5: Building on Common Ground: Integrating Principals, Practices and Programs to Support Research Data Management
- Time: 13:15 - 14:45
- Location: Blegen Hall 120
- Chair: Amy West
- Track: Data Infrastructure and Applications
Transparency from Scratch: Encouraging Openness and Enhancing Publications in Qualitative Political Science
- Presenters: Colin Elman (Syracuse University) and Diana Kapiszewski (Georgetown University)
- Abstract: The American political science community is engaged in a rigorous and wide-ranging conversation about research transparency, involving communities from across the epistemic spectrum. Broad consensus exists on the need for openness and for the project's general principles to be instantiated in research tradition-specific practices that preserve methodological diversity. Nonetheless, transparency is a novel project for most qualitative political scientists, requiring the development of new practices and strategies. This essay highlights the epistemic, intellectual, and sociological challenges of augmenting transparency in qualitative research -- and some of the pragmatic and operational difficulties of doing so. A central challenge is representing digital documents in on-line publications. We highlight an innovative transparency technique, active citation (Moravcsik 2010), which involves hyper linking citations to central or controversial text in a publication to an accompanying "transparency appendix" (TRAX). A TRAX comprises an overview of the trajectory of the research project underlying the publication, an excerpt from the cited source, an annotation identifying the micro-connection between the cited source and the textual claim, and ideally a link to/copy of the source itself. We conclude by discussing the implications for international scholars of more data being made available, and research being made more transparent, in this novel fashion.
Making data citation connections
- Presenter: Anne Etheridge, UK Data Archive
- Abstract: The UK Data Service is exploring ways of citing data from study level to subsets of data to paragraphs of text. We produce citations for each of our data catalogue records in Discover. Each citation includes a persistent identifier, created via DataCite, to give a unique access code for the data. We are working on downloading the citations in multiple formats and adapting the tools we have for our qualitative citations to make theses citations easier to find and use. We have been working with the Research Data Alliance Data Citation Working Group to find the best ways to cite subsets of data and apply them to our Nesstar records and international macrodata. We have tools to dynamically create a citation from paragraphs in qualitative text. Users select a passage and we then mint a unique identifier on the fly that can be used to cite, precisely, that piece of text. Others reading subsequent research can then go straight to that particular paragraph to read the text in context. Our tools allow the citation to be simply copied and pasted into any reference list.
Bridging Disciplines: Assessing the Interdisciplinary Impact of Open Data
- Presenters: Robert R. Downs and Robert S. Chen, Columbia University
- Abstract: Freely disseminating scientific data can contribute to multiple disciplines across the physical, social, health, and engineering sciences. If the impact of data centers is not measured, stakeholders will not know whether data centers, archives, and libraries, and the data that they disseminate, are having a positive impact on the conduct of science. Data citations provide evidence on the use of data in various stages of the research process, including problem definition, statistical analysis, modeling, and validation. Measuring the interdisciplinary citation of scientific data disseminated by a data center can reveal the degree to which the data center is supporting cross-disciplinary research. Analysis of a decade of data citations demonstrates the interdisciplinary use of scientific data and the impact that one data center has had across disciplinary boundaries.
Streaming access to oral history data
- Presenter: Marion Wittenberg, Data Archiving and Networked Services (DANS)
- Abstract: DANS, the research data archive in the Netherlands, has a growing collection of audiovisual data. This includes the witnesses' stories of the Second World War Heritage Program, the Oral History Project Indonesia, and interviews with Dutch Veterans. The collection, with almost 2000 interviews, is accessed by various users. For privacy reasons not all datasets are open access. In my presentation I will introduce the way in which we treat the audio and video data, the difference between high-resolution archival storage and streaming access, restricted access control for privacy sensitive data and future plans for subtitle search.
Freedom on the Move: Discovering the Plight of Runaway Slaves in the United States
- Presenters: Ed Baptist, Jeremy Williams, and Bill Block, Cornell University
- Abstract: Slavery is one of the most traumatic and defining aspects of United States history. Despite this fact, there is a paucity of machine actionable data about the individuals who were bought and sold as slaves in the United States. Substantial information does exist, however, in the form of advertisements, placed by enslavers, in antebellum newspapers. These advertisements included any detail that might help readers identify the fugitive: the name, height, build, appearance, clothing, literacy level, language, accent and so on of the runaway, but are not in formats that are amenable to analysis. Led by Cornell University, Freedom on the Move (FOTM) is a comprehensive and highly collaborative effort to transcribe and parse an estimated 100,000 advertisements using OCR and crowd-sourcing to create a new academic data resource. The data is stored in a relational database which is described and published using DDI-DISCO and W3C-PROV ontologies. This paper will introduce the project, provide an overview of the system architecture, and describe how FOTM hopes to use semantic metadata to facilitate discovery by researchers, data citation, and interoperability with other datasets.
Web Archiving for collection development: Capturing event data on the Umbrella Movement
- Presenter: Daniel Tsang, University of California, Irvine
- Abstract: Bibliographers have been slow to recognize web archive as a function of collection development, beyond personnel in Special Collections (archiving university domain) or government documents (archiving government sites). Yet as more and more data is generated online, including in the social sciences, it is timely to look at how web archiving can fit into a collection development policy and be part of a selector's duties. I assess existing collection development policies on web archiving in selected academic libraries and national institutions. This presentation focuses on archiving web content relating to Hong Kong's Umbrella Movement and asseses the complications of such an endeavor in collection development and what can actually be captured in a web crawl. It evaluates the research value of such a collection while highlighting some key criteria for selecting sites to crawl. It discusses the issue of international crawling of sites in another country or region and the potential benefits and risks of such a project. Finally it offers a case study of how social media can be captured and made accessible to researchers in years to come.
- Time: 13:15 - 14:45
- Location: Blegen Hall 130
- Chair: Steven McEachern
- Track: Data Infrastructure and Applications
Aristotle Metadata Registry - A new contender for government metadata management
- Presenter: Samuel Spencer, National Health Performance Authority
- Abstract: The ISO/IEC 11179 specification remains the gold standard in the definition of metadata registries. However, to date there have been relatively few open and conformant implementations. The AIHW METEoR metadata registry has a strong reputation as a leading, standards conformant and public facing registry for government metadata, however it growth has pushed it further than its original scope and technological base can support. Based on the system architecture of METEoR, the Aristotle Metadata Registry is a rebuilt implementation that provides an free open-source, easy to install and scalable metadata registry. With an enterprise level search engine to improve discoverability, a thoroughly tested permissions suite that ensure security around of the publication of information, and rich authoring environment, Aristotle-MDR aspires to be the next in new phase of metadata registry. The use of the Object-oriented principles of the Python-based Django web framework compliments the principle of extensibile metadata as described by the ISO/IEC 11179 standard. This design allows Aristotle-MDR to support the inclusion of third-party modules to provide additional metadata objects, including health indicators, datasets and questionnaires, as well a wide range of export formats such as Adobe PDFs and multiple versions of the Data Documentation Initiative XML format.
RAIRD: Implementing GSIM for Norwegian Administrative Registers
- Presenter: Arofan Gregory [paper co-written with Ornulf Rinses, NSD], Metadata Technology North America
- Abstract: The Generic Statistical Information Model (GSIM) is a conceptual model for describing statistical data and metadata, created by the UNECE's High Level Group. It is having a profound effect on standards such as DDI, and is being widely implemented by statistical agencies. One such implementation is the RAIRD project: a joint effort between the Norwegian Data Archive and the Norwegian Statistical Agency to provide online analysis tools for a huge set of Norwegian administrative data. Like many registers, much of the Norwegian data describes events ("event history data"), which is not well-described using traditional approaches such as those found in GSIM. The RAIRD project used it as the basis of a GSIM implementation, which involved extending and refining the GSIM model. This presentation shows how traditional data and metadata models can be extended to better describe administrative registers and the metadata needed by systems supporting their online analysis.
Metadata in Action: Driving TREC Survey Data Production and Dissemination
- Presenter: Shane McChesney, Metadata Technology North America
- Abstract: In the context of the Translating Research in Elder Care (TREC) survey, we have been collaborating with Knowledge Utilization Studies Program (KUSP), part of the Faculty of Nursing at the University of Alberta, and NOORO Online Research, towards the establishment of a metadata driven platform for facilitating the production, dissemination, and analysis of the TREC2 survey data. The first wave of data collection is currently in progress. This presentation will demonstrate: (1) How metadata is leveraged to facilitate loading data into a MySql based data warehouse, enabling high performance access to all the survey program data. (2) Tools for exporting microdata subsets to statistical packages, in particular R, SAS, and SPSS, for computing/aggregating complex indicators or analysis by researchers (3) Bridging the platform with R-Shiny and R-Markdown, two open source products leveraging the R statistical platform, for the publication of data into dynamic web dashboards and the production reports . This project is supported by the Canadian Foundation for Innovation (CFI).
- Panel: Elizabeth Quigley (Institute for Quantitative Social Science) and Dan Valen (Figshare)
- Abstract: We can improve scientific communication to increase efficiency in the accumulation of knowledge. This requires at least two changes to the present culture. One change is conceptual - embracing that progress is made more rapidly via identifying errors in current beliefs than by finding support for current beliefs. Such a shift could reduce confirmation bias, unproductive theory testing, and the blinding desire to be right. The other change is practical - science will benefit from improving technologies to document and connect the entire lifecycle of research projects. This presentation will focus on the practical aspects, illustrated through the efficiencies gained via the Open Science Framework and its add-on connections to Dataverse and Figshare. The presentation will specifically talk about how research support teams (ie. data librarians, repository managers, and others) can utilize these tools to help their users improve daily workflows.
- Panel: Lizzy Rolando (Georgia Tech Library) and Kelly Chatain (Institute for Social Research, University of Michigan)
- Abstract: Research data management continues to emerge as a distinct information discipline with unique needs, policies and practices, but there are many ways in which it overlaps with the existing disciplines of records management and archives. Examining areas where policies, practices, and resources can be shared between them is increasingly valuable as the digital information universe becomes more complex. This session will examine those shared areas, highlighting efforts to engage with different information communities and programs. Kelly Chatain, Associate Archivist, University of Michigan, will present her work as an ‘embedded’ archivist within the Survey Research Center, focusing on records management tools and archiving principles used to facilitate a practical and cultural shift in the creation of data. Bethany Anderson, Visiting Archival Operations and Reference Specialist, University of Illinois at Urbana-Champaign, will discuss ways of integrating the work of academic archives and research data services to appraise, manage, and steward data. Research Data Librarian Lizzy Rolando will discuss Georgia Tech’s efforts to identify areas of convergence between the functional and policy requirements of a research data repository ecosystem and the requirements of a born-digital archives repository ecosystem.