Topic modeling data for research data management and data curation abstracts
Research articles on research data management and data curation were collected on December 19, 2018 through the Library Literature and Information Science Full Text database. The term “research data management” (in quotations) retrieved a set of 106 scholarly articles in which the search term appeared, as a phrase, somewhere in the article’s metadata (e.g., in the title, abstract, or keywords); “data curation” (in quotations) yielded a set of 111 scholarly articles– this character string also appeared somewhere in the article’s metadata. A publicly accessible version of this search is available through our Zotero group. Fifteen of the same articles were found in both the “research data management” and the “data curation” sets. These were left for analysis in both sets. Results were limited to include scholarly (peer-reviewed) journals (i.e., no professional journals) and articles (i.e., no book reviews, and no conference papers). The database search returned 217 articles in total, all of which included an abstract.
The topics were isolated for analysis using the MAchine Learning for LanguagE Toolkit (MALLET) implementation of LDA. MALLET is a package that includes statistical natural language processing, document classification, clustering, topic modeling, and information extraction applications for text analysis. For this analysis, the following MALLET specifications were used:
- 35 number of topics
- Remove stopwords
- No. of iterations: 200
- No. of topic words printed: 20
- Topic proportion threshold: 0.05