Session Block 4 – Thursday, June 4, 10:15-12:15
- Time: 10:15 - 12:15
- Location: Blegen Hall 120
- Chair: Karen Hogenboom
- Track: Data Services Professional Development
Data Professionals' training challenges in dynamic work environments
- Presenter: Adetoun Oyelude, Kenneth Dike Library, University of Ibadan
- Abstract: The use of Information and Communications Technology (ICT) by various data professionals like data scientists, data curators, data librarians, data archivists and others has been the focus of researchers worldwide in the past few decades. Workspaces, workplaces and workflows are evolving daily and oftentimes struggling to cope with the emerging technologies. In their various functions in workplaces where career advancement is a sign of progress, it is required that data professionals be further trained, and with enhanced skill, move up the career ladder. Training of data professionals to meet the expectations of the work environment is a thus a challenge. The challenges faced by data professionals in the course of training themselves are the focus of this paper. Using extensive literature review, and survey methods of gathering data, ten data professionals working in different types of work environments were interviewed about the challenges they faced in training. The challenges they identified and described as well as solutions to the challenges proposed are discussed. Recommendations are made on ways in which future challenges can be surmounted especially in the face of the dynamic nature of the technology driven work environment.
The Data Librarian Life Cycle
- Presenter: A. Michelle Edwards, CISER - Cornell Institute for Social and Economic Research
- Abstract: The journey of a data librarian or data specialist is certainly not a straight one, but one that can be very winding and extremely exciting and challenging all at once. If we follow the experiences of many data librarians, we can see a trend that closely mimics that of the data lifecycle. Whether you are the "accidentalâ€ data librarian or that individual where data is merely one of many hats, or the experienced data specialists, we see many common threads. We embrace the new challenge that data present (concept), we learn everything we can about that challenge (collection), we develop new skills (processing), often very unique skills, we develop dynamic services to conquer the challenge (distribution, discovery), we evaluate the service (analysis), and then we look forward to the next challenge (concept) that is already knocking down our doors. But, what happens when you change departments within your institution or you change institutions? Can we repurpose what we have learned and created? The goal of this paper is to present the approaches taken when a data librarian engages in the "re-purposing" stage of the Data Librarian Life Cycle.
Comparing policies for open data from publicly accessible international sources
- Presenter: Line Pouchard, Purdue University Libraries
- Abstract: The Continuous Analysis of Many Cameras (CAM2) project is a research project at Purdue University for Big Data and visual analytics. CAM2 collects over 60,000 publicly accessible video feeds from many regions around the world. These data come from 10 national and international sources, including New York City, the city of Honk Kong, Colorado, New South Wales, Ontario, and the National Park Service. These video feeds were originally collected for improving the scalability of image processing algorithms and are now becoming of interest to ecologists, city planners, and environmentalists. With CAM2's ability to acquire millions of images or many hours of videos per day, collecting this large quantities of data raises questions about data management. The data sources all have heterogeneous policies for data use. Separate agreements had to be negotiated between each source and the data collector. In this paper, we propose to compare data use policies that are attached to the video streams and study their implications for open access. One restriction is that some sources limit the longevity of the data. As the value of this data becomes realized over the long term, issues of storage capacity and cost of stewardship arise.
- Gary Berg-Cross (Data Foundation and Terminology (DFT) Working Group)
- Reagan Moore (Practical Policy Working Group)
- Allison Powell (Data Type Registries Working Group)
- Jane Greenberg (Metadata Standards Directory Working Group)
- Abstract: An international group of collaborating data professionals launched the Research Data Alliance (RDA) in March 2013 with the vision of sharing data openly across technologies, disciplines, and countries to address the grand challenges of society. RDA is supported by the European Commission, the U.S. National Science Foundation, and the Australian government, and it meets in plenary twice a year. Members of the RDA voluntarily work together in Working Groups with concrete deliverables or in exploratory Interest Groups. Some of the foundational RDA Working Groups have completed the first phase of their projects and have produced results. This session is intended to highlight their activities and accomplishments
- Gary Berg-Cross, Peter Wittenburg, and Raphael Ritz: Data Foundation and Terminology (DFT) Working Group
- In an era of Big Data we still lack widely used best practices and a common model in key areas of data organization. Without common terminology used to communicate between data communities, some aspects of data management and sharing are inefficient and costly, especially when integrating cross-disciplinary data. To advance the data community discussion towards a consensus core model with basic, harmonizing principles, we developed basic terminology based on more than 20 data models presented by experts coming from different disciplines and about 120 interviews and interactions with different scientists and scientific departments. From this we crafted a number of simple definitions around digital repository data based on an agreed conceptualization of such terms as digital object, persistent ID, state information, and digital repository.
- Reagan Moore and Rainer Stotzka: Practical Policy Working Group
- Computer-actionable policies are used to enforce management, automate administrative tasks, validate assessment criteria, and automate scientific analyses. The benefits of using policies include minimization of the amount of labor needed to manage a collection, the ability to publish to the users the rules that are being used, and the ability to automate process management. Currently scientific communities use their own sets of policies, if any. A generic set of policies that can be revised and adapted by user communities and site managers who need to build up their own data collection in a trusted environment does not exist. Thus, the goals of the working group are to bring together practitioners in policymaking and policy implementation; to identify typical application scenarios for policies such as replication, preservation, etc.; to collect and to register practical policies; and to enable sharing, revising, adapting, and reuse of computer-actionable policies. This presentation will provide an overview of the working group and its activities, including a recent survey to elicit the types of policies currently being enforced as well as policy areas considered to be the most important.
- Larry Lannom, Daan Broeder, Giridhar Manepalli, and Allison Powell: Data Type Registries Working Group
- A Data Type Registry provides a way to easily register detailed and structured descriptions of data that can range from simple single value elements up to complex multi-dimensional scientific datasets. The benefits of registration include enabling those who did not create a given instance of typed data to understand and potentially reuse it, to encourage others to use established data types in their own data collections and analysis efforts, and to build services and applications that could be applied to standardized data types. This presentation will focus on the need for and advantages of Data Type Registries and will provide a demonstration of the latest version of a registry prototype.
- Rebecca Koskela, Jane Greenberg, Alex Ball, and Keith Jeffery: Metadata Standards Directory Working Group
- The RDA Metadata Standards Directory WG (MSDWG) is comprised of individuals and organizations involved in the development, implementation, and use of metadata for scientific data. The continued proliferation and abundance of content-driven metadata standards for scientific data present significant challenges for individuals seeking guidance in the selection of appropriate metadata standards and automatic processing capabilities for manipulating digital data. A collaborative, open directory of metadata standards applicable to scientific data can help address these challenges. A directory listing metadata standards applicable to research data will be of tremendous benefit to the global community facing data management challenges. Previous efforts provide evidence of this need, although these undertakings were not intended to be collective, sustainable directories. Discipline-specific metadata efforts have led to duplicative work because of the lack of communication across communities. The RDA Metadata Directory can begin to address these limitations. The RDA's global platform and cross-disciplinary reach, combined with the capacity to leverage social technology, can support the development of a community-driven and sustainable RDA Metadata Directory. This presentation will demonstrate the latest version of the MSDWG prototype directory and discuss the use cases collected for the various ways this directory can be used.
- Time: 10:15 - 12:15
- Location: Blegen Hall 130
- Chair: Sam Spencer
- Track: Data Infrastructure and Applications
New ICPSR Tools for Data Discovery and Classification
- Presenter: Sanda Ionescu, ICPSR
- Abstract: Capitalizing on rich, standardized DDI-XML metadata, ICPSR continues to develop its suite of tools for data discovery and analysis with new features and applications. Recently, ICPSR launched an innovative tool that enables linking individual variables with concepts to help increase granularity in the comparison of variables and/or questions across studies and series of studies. The tool allows users to create personalized concept lists and tag variables from multiple studies with these concepts; interactive crosswalks display the variable-concept associations to further assist in data analysis, comparison, and harmonization projects. In addition to personal concept lists, it is possible to create public lists so that an organization can apply its own authoritative tagging and make this resource publicly available. The concept tagging tool is integrated with ICPSR's variable search and comparison functions that have also been upgraded with a novel feature allowing retrieval of separate lists of variables measuring different concepts within the same study. We will present and discuss the tagging tool and the enhanced search features using live examples, and will also introduce the public ICPSR classification of the American National Election Studies and General Social Survey collections and the resulting crosswalk displaying the ANES Time Series and the GSS iterations by individual years.
Public APIs: Extending access to the UK Data Service
- Presenter: John Shepherdson, UK Data Archive, University of Essex
- Abstract: The UK Data Service is providing access to its data and metadata holdings by making public some of its web service APIs. These REST APIs facilitate a self-service approach for our data producers and researchers alike, whilst also enabling 3rd party developers to write applications that consume our APIs and present novel and exciting ways of accessing and viewing some of the data collections that we hold. We have put new infrastructure in place to enable the provision of these APIs and have already run an App Challenge (for external developers to build mobile applications against our APIs) and added a data collection usage "leader board" as initial tests of the functionality, capacity, account management, developer documentation and performance aspects of our public APIs. The main infrastructure elements are an API management service, HTTP caching and routing and various API endpoints. The other major consideration was a set of design principles for the APIs so that developers have a consistent and predictable experience. This presentation will elaborate on the key components of the infrastructure and the API design guidelines.
Building a Public Opinion Knowledge Base at the Roper Center
- Presenters: Elise Dunham and Marmar Moussa, Roper Center for Public Opinion Research
- Abstract: A central and ongoing priority of the Roper Center for Public Opinion Research is the development and enhancement of state-of-the-art online retrieval systems that promote the discovery and reuse of public opinion data. It has become clear that foundational changes to the way the Center produces and manages its descriptive metadata throughout the data lifecycle would provide new and more efficient avenues for web application and tool development. In a collaborative effort to solidify the connection between cataloging and retrieval system development goals, the Center is developing a knowledge base system for managing and facilitating access to our vast collection of public opinion datasets. This presentation will provide an overview of the networked system of thesauri and controlled vocabularies that the Center is implementing to create the knowledge base as well as describe the automated classification process the team has developed using machine-learning techniques to repurpose existing metadata and enhance process integration throughout the metadata production workflow.
Colectica Portal vNext: Addressing New Data Discovery Challenges
- Presenter: Dan Smith, Colectica
- Abstract: A data portal designed to present managed research data has many tasks including making the data discoverability, documenting the research data management process, data access policies, standardized metadata, data linkages, longitudinal data support, programmatic access to the data and metadata, and integrating the data with existing systems. Colectica Portal has always solely focused on providing standardized metadata and metadata discovery, while many other tasks were left to other systems. This sole focus on metadata created challenges integrating rich DDI-Lifecycle information stored in the Portal with other applications that do not support the standard. This presentation will describe how the Colectica vNext project addresses these challenges in two distinct ways. One aspect of the vNext project is to present an integrated view of metadata and data. While the Colectica Portal historically presented DDI metadata in a metadata centric fashion, the vNext project creates focus areas centered around surveys, research datasets, and study documentation. This allows users a familiar and user friendly view laid on top of the more advanced metadata descriptions. The second aspect is a focus on data discovery. The Portal vNext project supports a new programmatic API for both metadata and data search, allowing easier integration with existing systems.
- Panel: Lara Cleveland (IPUMS-International), Katie Genadek (IPUMS), Tracy Kugler (TerraPop), Jonathan Schroeder (NHGIS), and Dave Van Riper (TerraPop)
- Abstract: The Minnesota Population Center (MPC) is home to several large social science data infrastructure projects disseminating census and survey from the U.S. and around the world (www.ipums.org). Data integration is a feature of all of our data infrastructure projects as we strive to make data from different sources, time periods, geographies and formats interoperable. We have tackled variable integration across time and place, geographic harmonization, and data integration across data types such as satellite imagery and census data. In 1995, the IPUMS project capitalized on new web technology to develop one of the earliest integrated systems for electronic dissemination of data and documentation. Since then, we have continued to develop our data access tools and to enhance our metadata. This session will provide an overview of all MPC data infrastructure projects with a closer look at IPUMS, NHGIS and TerraPop. For these projects, we will review integration challenges, discuss our integration philosophy, and provide demonstrations of our data access systems.
New Curation Software: Step-by-Step Preparation of Social Science Data and Code for Publication and Preservation
- Presenters: Limor Peer (Yale University) and Stephanie Wykstra (Innovations for Poverty Action)
- Abstract: As data-sharing becomes more prevalent through the natural and social sciences, the research community is working to meet the demands of managing and publishing data in ways that facilitate sharing. Despite the availability of repositories and research data management plans, fundamental concerns remain about how to best manage and curate data for long-term usability. The value of shared data is very much linked to its usability, and a big question remains: What tools support the preparation and review of research materials for replication, reproducibility, repurposing, and reuse? This paper describes new data curation software designed specifically for reviewing and enhancing research data. It is being developed by two research groups, the Institution for Social and Policy Studies at Yale University and Innovations for Poverty Action, in collaboration with Colectica. The software includes curation steps designed to improve the research materials and thus to enable users to derive greater value from the data: Checking variable-level and study-level metadata, replicating code to reproduce published results, and ensuring that PII is removed. The tool is based upon the best practices of data archives and fits into repository and research workflows. It is open-source, extensible, and will help ensure shared data can be used.
Using CED²AR to Improve Data Documentation and Discoverability within the United States Federal Statistical System Research Data Center (FSS-RDC)
- Presenter: William Block (Cornell University) and Todd K. Gardner (U.S. Census Bureau)
- Abstract: The secure environment within the Federal Statistical System Research Data Center (FSS-RDC) supports qualified researchers in the United States while protecting respondent confidentiality with state-of-the-art tools and processes. While the FSS-RDC contains data from an increasing variety of sources, few standards exist for the format and detail of metadata that RDC researchers have at their disposal. Data producers do not, as a rule, consider future research use of their data; rather, the metadata they produce is oriented toward the immediate objective at hand. Still, the RDCs need to have thorough documentation in order for researchers to carry out their projects. This presentation provides an update on the Comprehensive Extensible Data Documentation and Access Repository (CED²AR), a lightweight DDI-driven web application designed to improve the documentation and discoverability of both public and restricted data from the federal statistical system. CED²AR is part of Cornell's node of the NSF-Census Research Network (NCRN) and is now available within the FSS-RDC environment. CED²AR is being used by researchers not familiar with XML or DDI to document their data, supports variable level searching and browsing across codebooks, passively versions metadata, offers an open API for developers, and is simple to get up and running.
DDI as RDM: Documenting a multi-disciplinary longitudinal study.
- Presenter: Barry Radler, University of Wisconsin-Madison
- Abstract: Adhering to research data management principles greatly clarifies the processes used to capture and produce datasets, and the resultant rich metadata provides users of those datasets the information needed to analyze, interpret, and preserve them. These principles are even more important with longitudinal studies that contain thousands of variables and many different data types. MIDUS (Midlife in the United States) is a national longitudinal study of approximately 12,000 Americans that studies aging as an integrated bio-psychosocial process. MIDUS has a broad and unique blend of social, health, and biomarker data collected over 20 years through a variety of modes. For nearly 10 years, MIDUS has relied on DDI to help manage and document these complex research data. In late 2013, the National Institute on Aging funded MIDUS to improve its DDI infrastructure by creating a DDI-based, harmonized data extraction system. Such a system allows researchers to easily create documented and citable data extracts that are directly related to their research questions and allows more time to be spent analyzing data instead of managing it. This presentation will explain the rationale, methods, and results of the project.