Function prediction system based on oligopeptide frequency distance


We have developed a method to correlate function-unknown proteins with COG categories solely on oligopeptide frequency distance (OPD).
The OPD method is suitable for predictiong the functions of proteins with low sequence homology.

The procedure of function prediction is as follows.

  1. Protein fragmentation
    Fragmentation is performed for analysis independent of sequence length.
  2. Di, tri, and tetra peptide frequencies calculations
    Dipeptide frequencies were calculated with 20 amino acid.
    (The 400 (=20^2) dimensional vectorial data were abbreviated as Di20.)
    Tripeptide frequencies were calculated with the degenerate 11 groups of residues, in which amino acids having similar physico-chemical properties were grouped as the same residue: {V, L, I}, {T, S}, {N, Q}, {E, D}, {K, R, H}, {Y, F, W}, {M}, {P}, {C}, {A} and {G}.
    (The 1331 (=11^3) dimensional vectorial data were abbreviated as Tri11.)
    Tetrapeptide frequencies were calculated with the degenerate 6 groups of residues, in which amino acids having similar physico-chemical properties were grouped as the same residue: {V, L, I, M}, {T, S, P, G, A}, {E, D, N, Q}, {K, R, H}, {V, F, W} and {C}
    (The 1296 (=6^4) dimensional vectorial data were abbreviated as Tetra6.)
  3. Directly compare the results of di, tri, and tetra peptide frequencies calculations with the total protein continuous frequency information in the database
    Calculate the Euclidean distances of both, and make the protein with the smallest Euclidean distance as the prediction candidate
  4. Selection of final prediction function candidates
    Of the functions selected from each fragment, more than 60% of the functions will be final candidates
  5. Final prediction
    Final prediction of protein function according to final prediction conditions
Please download program of the this protein function prediction system.

[Reference]
Takashi Abe, Ryo Ikarashi, Masaya Mizoguchi, Masashi Otake, Toshimichi Ikemura. A strategy for predicting gene functions from genome and metagenome sequences on the basis of oligopeptide frequency distance. Genes & Genetic Systems, 95, 11-19, 2020