Data Curation in Practice: Extract Tabular Data from PDF Files Using a Data Analytics Tool

Data curation is the process of managing data to make it available for reuse and preservation and to allow FAIR (findable, accessible, interoperable, reusable) uses. It is an important part of the research lifecycle as researchers are often either required by funders or generally encouraged to preserve the dataset and make it discoverable and reusable. This has been especially important as the Open Access (OA) policy is being implemented in many institutions across the nation. In facilitating research data discovery and enhancing its easier reuse, an efficient data repository and its data curation play key roles. In this article, we briefly discuss the local institutional repository at Penn State University and the general data curation practices we adopt for the deposited files and datasets, then we focus on a data analytics tool that has recently been applied to extract tabular data from PDF files. This is an enhancement to the existing data curation practices as it adds additional tabular data to deposits with PDF files where tables are often embedded and not easily reused.

This material is brought to you by eScholarship@UMassChan. It has been accepted for inclusion in Journal of eScience Librarianship by an authorized administrator of eScholarship@UMassChan. For more information, please contact Lisa.Palmer@umassmed.edu.

Files

Metadata

Work Title Data Curation in Practice: Extract Tabular Data from PDF Files Using a Data Analytics Tool
Subtitle Extract Tabular Data from Tabular Data from PDF Files om PDF Files Using a Data Analytics Tool
Access
Open Access
Creators
  1. Xuying Xin
  2. Allis Choi
Keyword
  1. Data Curation, Institutional Repository, Open Science, Curation Tools, Power BI, Tableau, FAIR reuse, Data Share, Open Access
License CC BY 4.0 (Attribution)
Work Type Article
Acknowledgments
  1. Keith Cheng, Ally Laird, Paulina Krys,Tara Anthon, Hannah Hadley, and Cynthia-Hudson Vitale
Publisher
  1. JeSLIB
Publication Date December 9, 2021
Language
  1. English
Publisher Identifier (DOI)
  1. 10.7191/jeslib.2021.1209
Geographic Area
  1. PA
Related URLs
Deposited April 27, 2022

Versions

Analytics

Collections

This resource is currently not in any collection.

Work History

Version 1
published

  • Created
  • Updated
  • Updated
  • Updated Acknowledgments Show Changes
    Acknowledgments
    • Keith Cheng, Ally Laird, Paulina Krys, and Tara Anthon
  • Added Creator Xuying Xin
  • Added Creator Allis Choi
  • Added Data Curation in Practice.pdf
  • Updated Acknowledgments, License Show Changes
    Acknowledgments
    • Keith Cheng, Ally Laird, Paulina Krys, and Tara Anthon
    • Keith Cheng, Ally Laird, Paulina Krys,Tara Anthon, Hannah Hadley, and Cynthia-Hudson Vitale
    License
    • https://creativecommons.org/licenses/by/4.0/
  • Published
  • Updated