Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning

Machine learning can infer how protein sequence maps to function without requiring a detailed understanding of the underlying physical or biological mechanisms. It is challenging to apply existing supervised learning frameworks to large-scale experimental data generated by deep mutational scanning (DMS) and related methods. DMS data often contain high-dimensional and correlated sequence variables, experimental sampling error and bias, and the presence of missing data. Notably, most DMS data do not contain examples of negative sequences, making it challenging to directly estimate how sequence affects function. Here, we develop a positive-unlabeled (PU) learning framework to infer sequence-function relationships from large-scale DMS data. Our PU learning method displays excellent predictive performance across ten large-scale sequence-function datasets, representing proteins of different folds, functions, and library types. The estimated parameters pinpoint key residues that dictate protein structure and function. Finally, we apply our statistical sequence-function model to design highly stabilized enzymes. The quantity of protein sequence-function data is growing rapidly with advances in high-throughput experimentation. Song et al. present a machine learning approach to infer sequence-function relationships from large-scale data generated by deep mutational scanning. The learned models capture important aspects of protein structure and function and can be applied to design new and enhanced proteins.

Files

Metadata

Work Title Inferring Protein Sequence-Function Relationships with Large-Scale Positive-Unlabeled Learning
Access
Open Access
Creators
  1. Hyebin Song
  2. Bennett J. Bremer
  3. Emily C. Hinds
  4. Garvesh Raskutti
  5. Philip A. Romero
License CC BY-NC-ND 4.0 (Attribution-NonCommercial-NoDerivatives)
Work Type Article
Publisher
  1. Elsevier BV
Publication Date January 2021
Publisher Identifier (DOI)
  1. 10.1016/j.cels.2020.10.007
Source
  1. Cell Systems
Deposited September 09, 2021

Versions

Analytics

Collections

This resource is currently not in any collection.

Work History

Version 1
published

  • Created
  • Added 2020.08.19.257642v1.full-1.pdf
  • Added Creator Hyebin Song
  • Added Creator Bennett J. Bremer
  • Added Creator Emily C. Hinds
  • Added Creator Garvesh Raskutti
  • Added Creator Philip A. Romero
  • Published
  • Updated
  • Updated
  • Updated