PS4: a next-generation dataset for protein single-sequence secondary structure prediction
2023
Omar Peracha
Protein secondary structure prediction is a subproblem of protein folding. A light-weight algorithm capable of accurately predicting secondary structure from only the protein residue sequence could provide useful input for tertiary structure prediction, alleviating the reliance on multiple sequence alignments typically seen in today's best-performing models. Unfortunately, existing datasets for secondary structure prediction are small, creating a bottleneck. We present PS4, a dataset of 18,731 nonredundant protein chains and their respective secondary structure labels. Each chain is identified, and the dataset is nonredundant against other secondary structure datasets commonly seen in the literature. We perform ablation studies by training secondary structure prediction algorithms on the PS4 training set and obtains state-of-the-art accuracy on the CB513 test set in zero shots.
Afficher plus [+] Moins [-]Mots clés AGROVOC
Informations bibliographiques
Cette notice bibliographique a été fournie par Directory of Open Access Journals
Découvrez la collection de ce fournisseur de données dans AGRIS