1. User Guide
1.1. Overview
You can run Rapide Peptides Generator using the standalone version called:
rpg
You can obtain help by using:
rpg --help
1.2. Installation
1.2.1. From pip
The suggested way of installing the latest RPG version is through pip:
pip3 install rpg
Then you can use:
rpg --help
1.2.2. From source code
RPG is coded in Python. To manually install it from source, get the source and install RPG using:
git clone https://gitlab.pasteur.fr/nmaillet/rpg/
cd rpg
python setup.py install
1.2.3. Using without installation
You can download the source code from Pasteur’s Gitlab: https://gitlab.pasteur.fr/nmaillet/rpg/.
In order to directly run RPG from source, you need to copy file tests/context.py
into rpg
folder.
Then, uncomment line 42 of rpg/RapidPeptidesGenerator.py
. Modify:
#from context import rpg
To:
from context import rpg
Then, from the main RPG directory, use:
python3 rpg/RapidPeptidesGenerator.py -\\-help
Warning
Using without installation is not recommended, as you need all requirements of requirements.txt installed locally and you may encounter issues with Sphinx autodoc or other unwanted behaviors.
1.3. Classical use
Here are some typical examples of RPG usage.
1.3.1. Getting help
To access build-in help, use:
rpg -\\-help
1.3.2. Listing enzymes
To list all available enzymes, use:
rpg -l
1.3.3. Performing digestion
There are two digestion modes in RPG. In sequential mode, each protein will be digested by each enzyme, one by one. In concurrent mode, all enzymes are present at the same time during digestion. See Digestion modes for more information.
1.3.3.1. Sequential digestion of one sequence
To perform sequential digestion of the sequence “QWSDORESDF” with enzymes 2 and 5 and store results in output_file.fasta, use:
rpg -s QWSDORESDF -o output_file.fasta -e 2 5
1.3.3.2. Sequential digestion of a (multi)fasta file
To perform sequential digestion of input_file.fasta with enzymes 2 and 5 and store results in output_file.fasta, use:
rpg -i input_file.fasta -o output_file.fasta -e 2 5
1.3.3.3. Concurrent digestion of a (multi)fasta file
To perform concurrent digestion of input_file.fasta with enzymes 2 and 5 and store results in output_file.fasta, use:
rpg -i input_file.fasta -o output_file.fasta -e 2 5 -d c
1.3.4. Adding a new enzyme
To extend the list of the available enzymes and add a new one, use:
rpg -a
See Creating a new enzyme for more information.
1.4. Options
Here are all available options in RPG:
-h, -\-help: Show this help message and exit.
-a, -\-addenzyme: Add a new user-defined enzyme. See Creating a new enzyme for more information.
-b, -\-delenzyme: Delete a user-define enzyme. See Deleting user-defined enzymes for more information.
-d, -\-digest: Digestion mode. Either ‘s’, ‘sequential’, ‘c’ or ‘concurrent’ (default: s). See Digestion modes for more information.
-e, -\-enzymes: Enzyme(s) id number or name to use (e.g. -e 1 Asp-N 10 to use enzymes 1, Asp-N and 10). Use -l first to get enzyme ids. See Enzyme definitions for more information.
-f, -\-fmt: Output file format. Either ‘fasta’, ‘csv’, or ‘tsv’ (default: fasta). See Output for more information.
-i, -\-inputdata: Input file, in (multi)fasta / fastq format (gzipped or not). See Sequential digestion of a (multi)fasta file for example.
-s, -\-sequence: Input a single protein sequence without commentary. See Sequential digestion of one sequence for example.
-l, -\-list: Display the list of available enzymes.
-m, -\-miscleavage: Percentage of miscleavage, between 0 and 100, by enzyme(s). It should be in the same order as the -\-enzymes options (e.g. -m 15 5.2 10). It works only for sequential digestion (default: 0). See Miscleavage for more information.
-t, -\-theoretical: Maximum number of miscleavages, by enzyme(s). RPG will produce all theoretical peptides allowing at most -t miscleavages. It should be in the same order than -\-enzymes options (e.g. -t 1 3 2). Only for sequential digestion (default: 0). See All theoretical miscleavage for more information.
-n, -\-noninteractive: Non-interactive mode. No standard output, only error(s) (-\-quiet enable, overwrite -v). If output filename already exists, output file will be overwritten. See Non-interactive mode for more information.
-x, -\-cleave: Cleavage site(s) when adding enzyme in non-interactive mode (e.g. -x “(D or E,)”, similar to -x “(D,)” “(E,)”. See Creating a new enzyme for more information.
-z, -\-exception: Exception(s) when adding enzyme in non-interactive mode (e.g. -z “(D,)(W)” “(D,)(A)”. See Creating a new enzyme for more information.
-y, -\-enzname: Enzyme’s name when adding enzyme in non-interactive mode. See Creating a new enzyme for more information.
-o, -\-outputfile: Result file to output resulting peptides (default ‘./peptides.xxx’ depending of -\-fmt).
-p, -\-pka: Define pKa values. Either ‘ipc2’, ‘stryer’ or ‘ipc’ (default: ipc2). IPC2 values come from IPC_peptide in Supplementary Table S1, Stryer values from Biochemistry Stryer, 7th edition and IPC values from IPC_peptide.
-w, -\-mass: Define atomic mass values. Either ‘avg’ for average mass or ‘mono’ for monoisotopic mass.
-r, -\-randomname: Random (not used) output name file. See Random names for more information.
-c , -\-processes: Number of parallel processes to use (default: 1)
-q, -\-quiet: No standard output, only error(s).
-v, -\-verbose: Increase output verbosity. -vv will increased more than -v. See Verbosity for more information.
-\-version: Show program’s version number and exit.
1.5. Digestion modes
There are two digestion modes in RPG. In ‘sequential’ mode, each protein will be digested by each enzyme, one by one. Launching 3 times RPG on the same protein with 3 different enzymes or launching one time RPG on the protein with the 3 enzymes in ‘sequential’ mode leads to exactly the same result.
In concurrent mode, all enzymes are present at the same time during digestion and exposure time is supposed to be infinite, i.e. all possible cleavages will occur (there is no miscleavage). In this mode, the cleavage of a first enzyme can make available the cleavage site of another enzyme.
Let’s define two enzymes. The first is called ‘afterP’ (id 28) and cleaves after P. The second is called ‘afterK’ (id 29) and cleaves after K if there is no P just before. Digesting ‘PKPKPKPK’ using those two enzymes in sequential mode gives the following result (see Output for more information):
$ rpg -s PKPKPKPK -e 28 29
>Input_0_afterP_1_1_115.13198_5.54
P
>Input_1_afterP_3_2_243.30608_9.4
KP
>Input_2_afterP_5_2_243.30608_9.4
KP
>Input_3_afterP_7_2_243.30608_9.4
KP
>Input_4_afterP_8_1_146.18938_9.4
K
>Input_0_afterK_0_8_919.17848_11.27
PKPKPKPK
‘afterP’ cleaves as expected and ‘afterK’ is not able to cleave anything.
Digesting ‘PKPKPKPK’ using those two enzymes in concurrent mode gives the following result:
$ rpg -s PKPKPKPK -e 28 29 -d c
>Input_0_afterP-afterK_1_1_115.13198_5.54
P
>Input_1_afterP-afterK_2_1_146.18938_9.4
K
>Input_2_afterP-afterK_3_1_115.13198_5.54
P
>Input_3_afterP-afterK_4_1_146.18938_9.4
K
>Input_4_afterP-afterK_5_1_115.13198_5.54
P
>Input_5_afterP-afterK_6_1_146.18938_9.4
K
>Input_6_afterP-afterK_7_1_115.13198_5.54
P
>Input_7_afterP-afterK_8_1_146.18938_9.4
K
Here, we have to understand that ‘afterP’ cleaves at the same positions as in sequential mode and the products (mostly ‘KP’) are then cleaved by ‘afterK’. Indeed, there is no more P before K, making ‘afterK’ able to cleave.
Default mode is ‘sequential’. Reminder: you can input miscleavage values only for this mode.
1.6. Miscleavage
Sometimes an enzyme does not cleave at a given position even if requirements are fulfilled. This event is called miscleavage and can have biological, chemical or physical origins. To take into account this behavior in RPG, one can assign a miscleavage value to an enzyme, expressed as a percentage.
For example, using:
rpg -s QWSDORESDF -e 1 2 3 -m 1.4 2.6
will assign a miscleavage probability of 1.4% to enzyme 1, a miscleavage probability of 2.6% to enzyme 2 and a miscleavage probability of 0% to enzyme 3 (default behavior). For enzyme 1, each cleavage will then have a probability of 0.014 to not occur.
1.7. All theoretical miscleavage
You can compute all theoretical peptides with at most n miscleavages. RPG will then produce all peptides without any miscleavage, with 1 miscleavage, with 2 miscleavage, etc, up to n.
For example:
$ rpg -s AAWBBBWCCCWD -e 3 -t 1
>Input_0_BNPS-Skatole_3_3_346.38608_5.46
AAW
>Input_1_BNPS-Skatole_7_4_204.22848_5.46
BBBW
>Input_2_BNPS-Skatole_11_4_513.64488_5.45
CCCW
>Input_3_BNPS-Skatole_12_1_133.10388_3.48
D
>Input_4_BNPS-Skatole_7_7_532.59928_5.46
AAWBBBW
>Input_5_BNPS-Skatole_11_8_699.85808_5.45
BBBWCCCW
>Input_6_BNPS-Skatole_12_5_628.73348_3.48
CCCWD
>Input_7_BNPS-Skatole_11_11_1028.22888_5.45
AAWBBBWCCCW
>Input_8_BNPS-Skatole_12_9_814.94668_3.48
BBBWCCCWD
1.8. Non-interactive mode
Option -n, -\-noninteractive force RPG to not print any standard output, only error(s) are displayed in the shell. It enable ‘-\-quiet’ option and overwrites -\-verbose option. If output filename already exists, the output file will be systematically overwritten. This option is mostly used in cluster or pipeline when user does not want RPG to wait for input or display anything but errors.
1.9. Output
Output of RPG contains the following information in one line for each generated peptide, in this order:
Header of original sequence or ‘Input’ if the sequence is directly inputed in RPG, i.e., -s
Sequential numbering (starting from 0) of out-coming peptides for each of original sequence
Enzyme name used to obtain this peptide
Cleavage position on the original sequence (0 if no cleavage occurs)
Peptide size
Peptide molecular weight estimation
Peptide isoelectric point estimation (pI)
Then, on the next line:
Peptide sequence
Peptide molecular weight approximation is computed as the addition of average isotopic masses of each amino acid present in the peptide. Then the average isotopic mass of one water molecule is added to it. Molecular weight values are given in Dalton (Da). It does not take into consideration any digestion-induced modifications. Masses are either 4 digits precision (average masses) or 9 digits precision (monoisotopic masses).
The isoelectric point is computed by solving Henderson–Hasselbalch equation using binary search. It is based on Lukasz P. Kozlowski work (http://isoelectric.org/index.html).
The default output is in multi-fasta format. The header then summarizes all this information. For example, on the following multi-fasta result:
>Input_0_Asp-N_3_3_419.4373_5.54
QWS
>Input_1_Asp-N_8_5_742.7868_4.16
...
we can see that a sequence was directly inputed in RPG (Input), the first peptide (0) was obtained with Asp-N and this enzyme cleaved after the 3rd amino acid in the original sequence. The peptide has a size of 3 amino acids, a molecular weight estimated at 419.43738 Da and a theoretical isoelectric point of 5.54. The full sequence is then written (QWS). The output of the remaining peptides follows in the same format.
More information can be outputted using Verbosity option.
1.10. Random names
Option -r, -\-randomname force RPG to use a random name for output file. When using it, RPG will not ask user output file name nor location. The output file will be created in the working directory. This option is generally used for testing or automatic tasks.
1.11. Verbosity
Verbosity can be increased or decreased. The output file is not affected by -v or -q options.
With default verbosity level (no -v nor -q option), the output is:
$ rpg -s QWSDORESDF -e 1
>Input_0_Asp-N_3_3_419.43738_5.54
QWS
>Input_1_Asp-N_8_5_742.78688_4.16
DORES
>Input_2_Asp-N_10_2_280.28048_3.6
DF
Increasing verbosity by one, i.e. using -v, adds information about used options. For example:
$ rpg -s QWSDORESDF -e 1 -v
Warning: File 'peptides.fasta' already exit!
Overwrite it? (y/n)
y
Input: QWSDORESDF
Enzyme(s) used: ['Asp-N']
Mode: sequential
miscleavage ratio: [0]
Output file: /Users/nmaillet/Prog/RPG/peptides.fasta
>Input_0_Asp-N_3_3_419.43738_5.54
QWS
>Input_1_Asp-N_8_5_742.78688_4.16
DORES
>Input_2_Asp-N_10_2_280.28048_3.6
DF
Increasing verbosity by two, i.e. using -vv, also adds statistics about each of the digested proteins. For example:
$ rpg -s QWSDORESDF -e 1 -vv
Warning: File 'peptides.fasta' already exit!
Overwrite it? (y/n)
y
Input: QWSDORESDF
Enzyme(s) used: ['Asp-N']
Mode: sequential
miscleavage ratio: [0]
Output file: /Users/nmaillet/Prog/RPG/peptides.fasta
Number of cleavage: 2
Cleavage position: 3, 8
Number of miscleavage: 0
miscleavage position:
miscleavage ratio: 0.00%
Smallest peptide size: 2
N terminal peptide: QWS
C terminal peptide: DF
>Input_0_Asp-N_3_3_419.43738_5.54
QWS
>Input_1_Asp-N_8_5_742.78688_4.16
DORES
>Input_2_Asp-N_10_2_280.28048_3.6
DF
Decreasing verbosity, i.e. using -q option, removes all information but errors. For example:
$ rpg -s QWSDORESDF -e 1 -q
Warning: File 'peptides.fasta' already exit!
Overwrite it? (y/n)
y
1.12. Creating a new enzyme
Option -a, -\-addenzyme allows the user to define new enzymes. An enzyme contains one or several rules and exceptions.
In the following, nomenclature of Schechter and Berger is used. Amino acids before the cleavage site are designated as P1, P2, P3, etc in the N-terminal direction, and as P1’, P2’, P3’, etc in the C-terminal direction. For example, with cleavage site represented as ‘|’:
...P3-P2-P1-|-P1'-P2'-P3'...
In RPG, this nomenclature is represented as:
...(P3)(P2)(P1)(,)(P1')(P2')(P3')...
1.12.1. Definition of rules
A rule specifies which amino acid is targeted by the enzyme, the cleavage position (i.e. before or after the targeted amino acid) and optionally the surrounding context. Each amino acid must be included in parentheses, i.e. ‘(’ and ‘)’ and the cleavage position is represented by a comma, i.e. ‘,’. The comma must always be directly before or after an closing or opening parenthesis, respectively.
For example, to define a cleavage occurring before A, one must input:
(,A)
To define a cleavage occurring after B, one must input:
(B,)
The surrounding context is specified by adding other amino acids, before or after the targeted one. For example, to define a cleavage occurring before A, position P1’, preceded by B in position P1, C in position P3 and followed by D in position P2’, one must input:
(C)()(B)(,A)(D)
Note that this enzyme will only cleave if it finds the motif C*BAD, where * could be any amino acid. It will not cleave BAD, nor C*BA, BA, etc. For example, creating and using enzyme rpg_example_userguide (enzyme id 46):
$ rpg -a
Name of the new enzyme?
rpg_example_userguide
Create a cleaving rule (c) or an exception (e)? (q) to quit:
c
Write your cleaving rule, (q) to quit:
(C)()(B)(,A)(D)
Create a cleaving rule (c) or an exception (e)? (q) to quit:
q
Add another enzyme? (y/n)
n
$ rpg -s CWBADE -e 46
>Input_0_rpg_example_userguide_3_3_307.36728_5.46
CWB
>Input_1_rpg_example_userguide_6_3_333.29818_3.4
ADE
$ rpg -s FAD -e 46
>Input_0_rpg_example_userguide_0_3_351.35928_3.6
FAD
In order for this enzyme to also cleave before AD (before A in P1’ followed by D in P2’ ), on top of the previous rule, one has to define one more rule in RPG:
(,A)(D)
(C)()(B)(,A)(D)
It is important to note that for each enzyme, it is enough that one of the rule is broken for the cleavage to not occur. In this example, the defined enzyme will not cleave BAD, as it is specified that it will cleave before A preceded by B in P1 if there is C in `P3`. Identically, it will not cleave C*BA*, as D is required in P2’ for both rules.
$rpg -a
Name of the new enzyme?
rpg_example_userguide
Create a cleaving rule (c) or an exception (e)? (q) to quit:
c
Write your cleaving rule, (q) to quit:
(,A)(D)
Create a cleaving rule (c) or an exception (e)? (q) to quit:
c
Write your cleaving rule, (q) to quit:
(C)()(B)(,A)(D)
Create a cleaving rule (c) or an exception (e)? (q) to quit:
q
Add another enzyme? (y/n)
n
$ rpg -s CWBADE -e 46
>Input_0_rpg_example_userguide_3_3_307.36728_5.46
CWB
>Input_1_rpg_example_userguide_6_3_333.29818_3.4
ADE
$ rpg -s FAD -e 46
>Input_0_rpg_example_userguide_1_1_165.19188_5.54
F
>Input_1_rpg_example_userguide_3_2_204.18268_3.6
AD
$ rpg -s BAD -e 46
>Input_0_rpg_example_userguide_0_3_204.18268_3.6
BAD
The order of inputted rules is not relevant. In other words, this enzyme:
(,A)(D)
(C)()(B)(,A)(D)
and this second one:
(C)()(B)(,A)(D)
(,A)(D)
are identical.
It is possible to define none-related cleavage rules for the same enzyme, for example:
(G,)(G)
(P)(W,)(E)(T)
This enzyme will cleave after G (position P1) followed by G in P1’ and also after W (P1) preceded by P in P2 and followed by E in P1’ and T in P2’.
Note that each rule must concern only one cleavage site. It is not possible to input rule like:
(A,)(B,)
This would define an enzyme cleaving after A in P1 followed by B in P1’ but also cleaving after B in P1 preceded by A in P2. The proper way to input this is by using two separate rules:
(A,)(B)
(A)(B,)
However, it is possible to write rules in a more efficient way as explained in Easily writing complex enzymes.
1.12.2. Definition of exceptions
An exception specifies when a cleavage should not occur. Exceptions must always be linked to a rule.
For example, to define a cleavage occurring before A (P1’ ), one must input:
(,A)
Exceptions can then be inputted. For example, to define “a cleavage occurs before A, except when P is in P2’ “, the following exception needs to be added:
(,A)(P)
This enzyme will always cleave before A when not followed by P:
rpg -a
Name of the new enzyme?
rpg_example_userguide
Create a cleaving rule (c) or an exception (e)? (q) to quit:
c
Write your cleaving rule, (q) to quit:
(,A)
Create a cleaving rule (c) or an exception (e)? (q) to quit:
e
Write your exception rule, (q) to quit:
(,A)(P)
Create a cleaving rule (c) or an exception (e)? (q) to quit:
q
Add another enzyme? (y/n)
n
rpg -s CWBADE -e 46
>Input_0_rpg_example_userguide_3_3_307.36728_5.46
CWB
>Input_1_rpg_example_userguide_6_3_333.29818_3.4
ADE
rpg -s CWBAPE -e 46
>Input_0_rpg_example_userguide_0_6_604.67828_3.6
CWBAPE
It is possible to input complex exceptions. For the previous enzyme, we can add the following exception:
(G)(T)()(,A)()(F)
This enzyme will always cleave before A (P1’ ) when not followed by P (P2’ ) or preceded by G in P3, T in P2, by any amino acid in P1 and P2’, and F in P3’ at the same time:
rpg -a
Name of the new enzyme?
rpg_example_userguide
Create a cleaving rule (c) or an exception (e)? (q) to quit:
c
Write your cleaving rule, (q) to quit:
(,A)
Create a cleaving rule (c) or an exception (e)? (q) to quit:
e
Write your exception rule, (q) to quit:
(,A)(P)
Create a cleaving rule (c) or an exception (e)? (q) to quit:
e
Write your exception rule, (q) to quit:
(G)(T)()(,A)()(F)
Create a cleaving rule (c) or an exception (e)? (q) to quit:
q
Add another enzyme? (y/n)
n
rpg -s CWBADE -e 46
>Input_0_rpg_example_userguide_3_3_307.36728_5.46
CWB
>Input_1_rpg_example_userguide_6_3_333.29818_3.4
ADE
rpg -s CWBAPE -e 46
>Input_0_rpg_example_userguide_0_6_604.67828_3.6
CWBAPE
rpg -s GTBAMF -e 46
>Input_0_rpg_example_userguide_0_6_525.62028_5.54
GTBAMF
rpg -s GTBAPE -e 46
>Input_0_rpg_example_userguide_0_6_473.48328_3.6
GTBAPE
rpg -s GTBAME -e 46
>Input_0_rpg_example_userguide_3_3_176.17228_5.54
GTB
>Input_1_rpg_example_userguide_6_3_349.40218_3.6
AME
It is important to understand that an exception should always be linked to a rule. If one inputs this rule:
(A,)
followed by this exception:
(B,)(C)
the exception will not be taken into account. This enzyme will just always cleave after A.
1.12.3. Easily writing complex enzymes
To make enzyme creation easier to use, two tricks are available.
The first one simplifies the definition of enzymes cleaving before and after a given amino acid. Defining an enzyme cleaving, for example, before and after A, can be done with two rules:
(,A)
(A,)
or simply using:
(,A,)
The second trick is the use of the keyword or. This allows multiple possibilities for on position. For example:
(,A or B)
is equivalent to:
(,A)
(,B)
Warning
do not input (,A or ,B)
, as a comma must always directly preceding or following a parenthesis.
Those two tricks help on complex enzymes. For example, Pepsin pH 1.3 preferentially cleaves around F or L, sometimes before, sometimes after, depending on the context. More specifically, it will not cleave before F or L in P1’ followed by P in P2’. It will not cleave before F or L in P1’ preceded by R in P1 or P in P2 or H/K/R in P3. It will not cleave after F or L in P1 followed by P in P2’. And it will not cleave after F or L in P1 preceded by P in P2 or H/K/R in P3.
It can be defined either by:
cleaving rules:
(F,)
(L,)
(,F)
(,L)
exception rules:
(,F)(P)
(,L)(P)
(R)(,F)
(R)(,L)
(P)()(,F)
(P)()(,L)
(H)()()(,F)
(K)()()(,F)
(R)()()(,F)
(H)()()(,L)
(K)()()(,L)
(R)()()(,L)
(F,)()(P)
(L,)()(P)
(P)(F,)
(P)(L,)
(H)()(F,)
(K)()(F,)
(R)()(F,)
(H)()(L,)
(K)()(L,)
(R)()(L,)
or, in a condensed way:
cleaving rule:
(,F or L,)
exception rules:
(,F or L)(P)
(R)(,F or L)
(P)()(,F or L)
(H or K or R)()()(,F or L)
(F or L,)()(P)
(P)(F or L,)
(H or K or R)()(F or L,)
Those two definitions are completely equivalent for RPG.
1.12.4. Example of enzymes
All available enzymes are in Enzyme definitions, including their RPG’s definition.
1.12.5. Non-interactive mode
Using -n, -\-noninteractive and -a, -\-addenzyme options together is possible, but then requires to also use -x, -\-cleave option to define cleaving site(s), -y, –enzname option to define the name of the enzyme and optionally -z, –exception option to define exceptions.
1.13. Deleting user-defined enzymes
All user-defined enzymes are stored in ~/rpg_user.py
. This file is automatically generated by RPG and written in Python.
Each enzyme definition starts with:
# User-defined enzyme <name of the enzyme>
and finishes with:
CPT_ENZ += 1
followed by 3 blank line.
To remove an enzyme, use the option -b, -\-delenzyme with the id or name of the user-defined enzyme to remove. It is not possible to remove predefined enzymes of RPG.
To remove all user-defined enzymes, just delete ~/rpg_user.py
file. It will be created again (empty) at the next launch of RPG.
Obviously, all deleted enzymes can not be recovered. If one wants to use them again they will need to be redefined in RPG, using -a, -\-addenzyme option.