Profile.scan() — Compare a target profile against a database of profiles

scan(profile_list_file, matrix_offset=0.0, profile_format='TEXT', rr_file='$(LIB)/as1.sim.mat', gap_penalties_1d=(-900.0, -50.0), matrix_scaling_factor=0.0069, max_aln_evalue=0.1, aln_base_filename='alignment', score_statistics=True, output_alignments=True, output_score_file=None, pssm_weights_type='HH1', summary_file='ppscan.sum', ccmatrix_offset=-200, score_type='CCMAT', psm=None)
This command scans the given target profile against a database of template profiles and reports significant alignments; the target profile should have been read previously with the Profile.read() command.

All the profiles listed in profile_list_file should be in a format that is understood by Profile.read().

The profile_list_file should contain absolute or relative paths to the individual template profiles, one per line.

The template profiles can also be assembled into a PSSM database, that can then be read in for scanning. The PSSM database can be created using the Environ.make_pssmdb() command.

For the sake of both efficiency and speed, it is recommended to read in the template profiles as a database. (See example). The entire PSSM database, consisting of tens of thousands of PSSMs, can be read into the memory of an average PC.

See documentation under Profile.read() for help on profile_format.

rr_file is the residue-residue substitution matrix to use when calculating the position-specific scoring matrix (PSSM). The current implementation is optimized only for the BLOSUM62 matrix.

gap_penalties_1d are the gap penalties to use for the dynamic programming. matrix_offset is the value to be used to offset the substitution matrix (used in PSSM calculation). ccmatrix_offset is used to offset the scoring matrix during dynamic programming. The most optimal values for these parameters are:

matrix_offset = -450 (for the BLOSUM62 matrix) ccmatrix_offset = -100 gap_penalties_1d = (-700, -70)

max_aln_evalue sets the threshold for the E-values. Alignments with e-values better than the threshold will be written out.

aln_base_filename sets the base filename for the alignments. The output alignment filenames will be of the form ALN_BASE_FILENAME_XXXX.ali. The XXXX is a 4-digit integer (prefixed with sufficient zeroes) that is incremented for each alignment. For example, alignment_0001.ali

score_statistics is a flag that triggers the calculation of e-values. If set to False, the significance estimates for the alignments will not be calculated. The calculation of alignment significance is similar to that used for Profile.build(). This option can be useful when there are only a very small number of template profiles in profile_list_file, insufficient to calculate reliable statistics. Also see Profile.build().

output_score_file is the name of a file into which to write the raw alignment scores, zscores and e-values for all the comparisons. (If it is set to None, no such output is written.) The various columns in the output file correspond to the following:

  1. Index of the database profile
  2. File name of the database profile
  3. Length of the database profile
  4. Logarithm of the length of the database profile
  5. Alignment score
  6. Length normalized z-score of the alignment
  7. E-Value of the alignment

summary_file is the name of a file into which to output a summary of all the significant alignments. (If it is set to None, no such output is written.) The format of the summary file is the following:

  1. File name of target profile (empty if the profile was created with Alignment.to_profile())
  2. Length of target profile
  3. Number of the first aligned residue of the target profile
  4. Number of the last aligned residue of the target profile

  5. File name of the database profile
  6. Length of the database profile
  7. Number of the first aligned residue of the database profile
  8. Number of the last aligned residue of the database profile

  9. Number of equivalent positions in the alignment
  10. Alignment score
  11. Sequence identity of the alignment
  12. Length normalized z-score of the alignment
  13. E-Value of the alignment

  14. Alignment file name

If output_alignments is set to False, alignments will not be written out.

In addition, the following details about every significant alignment is also written out to the log file (look for lines beginning with '>'):

  1. File name of target profile (empty if the profile was created with Alignment.to_profile())
  2. File name of the database profile
  3. Length of the database profile
  4. Alignment score
  5. Sequence identity of the alignment
  6. Length normalized z-score of the alignment
  7. E-Value of the alignment

Example: examples/commands/ppscan.py

# Example for: Profile.scan()

from modeller import *

env = Environ()

# First create a database of PSSMs
env.make_pssmdb(profile_list_file = 'profiles.list',
                matrix_offset     = -450,
                rr_file           = '${LIB}/blosum62.sim.mat',
                pssmdb_name       = 'profiles.pssm',
                profile_format    = 'TEXT',
                pssm_weights_type = 'HH1')

# Read in the target profile
prf = Profile(env, file='T3lzt-uniprot90.prf', profile_format='TEXT')

# Read the PSSM database
psm = PSSMDB(env, pssmdb_name = 'profiles.pssm', pssmdb_format = 'text')

# Scan against all profiles in the 'profiles.list' file
# The score_statistics flag is set to false since there are not
# enough database profiles to calculate statistics.
prf.scan(profile_list_file = 'profiles.list',
         psm               = psm,
         matrix_offset     = -450,
         ccmatrix_offset   = -100,
         rr_file           = '${LIB}/blosum62.sim.mat',
         gap_penalties_1d  = (-700, -70),
         score_statistics  = False,
         output_alignments = True,
         output_score_file = None,
         profile_format    = 'TEXT',
         max_aln_evalue    = 1,
         aln_base_filename = 'T3lzt-ppscan',
         pssm_weights_type = 'HH1',
         summary_file      = 'T3lzt-ppscan.sum')