Renumbering Antibody Sequences¶
One of the most complex parts of working with antibody sequences is that they have different definitions of numbering. Depending on the scheme, there are different definitions of where the framework and CDR regions are. SADIE provides a simple interface to renumber antibody sequences to a common numbering scheme. We borrow heavily from the Antigen receptor Numbering and Receptor Classification (ANARCI)
Single Sequence Annotation¶
# Use Renumbering module
# import pandas for dataframe handling
import pandas as pd
from sadie.renumbering import Renumbering
from sadie.renumbering.result import NumberingResults
# define a single sequence
vrc26_seq = "QKQLVESGGGVVQPGRSLTLSCAASQFPFSHYGMHWVRQAPGKGLEWVASITNDGTKKYHGESVWDRFRISRDNSKNTLFLQMNSLRAEDTALYFCVRDQREDECEEWWSDYYDFGKELPCRKFRGLGLAGIFDIWGHGTMVIVS"
# wrap in a function so we can use multiprocessing
def run() -> NumberingResults:
# setup API object
renumbering_api = Renumbering(scheme="chothia", region_assign="imgt", run_multiproc=True)
# run sequence and return airr table with sequence_id and sequence
numbering_table = renumbering_api.run_single("VRC26.27", vrc26_seq)
# output object types
print(numbering_table)
if __name__ == "__main__":
run()
Id | sequence | domain_no | hmm_species | chain_type | e-value | score | seqstart_index | seqend_index | identity_species | v_gene | v_identity | j_gene | j_identity | Chain | Numbering | Insertion | Numbered_Sequence | scheme | region_definition | allowed_species | allowed_chains | fwr1_aa_gaps | fwr1_aa_no_gaps | cdr1_aa_gaps | cdr1_aa_no_gaps | fwr2_aa_gaps | fwr2_aa_no_gaps | cdr2_aa_gaps | cdr2_aa_no_gaps | fwr3_aa_gaps | fwr3_aa_no_gaps | cdr3_aa_gaps | cdr3_aa_no_gaps | fwr4_aa_gaps | fwr4_aa_no_gaps | leader | follow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | VRC26.27 | QKQLVESGGGVVQPGRSLTLSCAASQFPFSHYGMHWVRQAPGKGLEWVASITNDGTKKYHGESVWDRFRISRDNSKNTLFLQMNSLRAEDTALYFCVRDQREDECEEWWSDYYDFGKELPCRKFRGLGLAGIFDIWGHGTMVIVS | 0 | human | H | 1.65353e-43 | 134.25 | 0 | 144 | human | IGHV3-30*03 | 0.8 | IGHJ3*02 | 0.64 | H | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 82, 82, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112] |a', 'b', 'c', '', '', '', '', '', '', '', '', '', '', '', ''] | ['Q', 'K', 'Q', 'L', 'V', 'E', 'S', 'G', 'G', 'G', 'V', 'V', 'Q', 'P', 'G', 'R', 'S', 'L', 'T', 'L', 'S', 'C', 'A', 'A', 'S', 'Q', 'F', 'P', 'F', 'S', 'H', 'Y', 'G', 'M', 'H', 'W', 'V', 'R', 'Q', 'A', 'P', 'G', 'K', 'G', 'L', 'E', 'W', 'V', 'A', 'S', 'I', 'T', 'N', 'D', 'G', 'T', 'K', 'K', 'Y', 'H', 'G', 'E', 'S', 'V', 'W', 'D', 'R', 'F', 'R', 'I', 'S', 'R', 'D', 'N', 'S', 'K', 'N', 'T', 'L', 'F', 'L', 'Q', 'M', 'N', 'S', 'L', 'R', 'A', 'E', 'D', 'T', 'A', 'L', 'Y', 'F', 'C', 'V', 'R', 'D', 'Q', 'R', 'E', 'D', 'E', 'C', 'E', 'E', 'W', 'W', 'S', 'D', 'Y', 'Y', 'D', 'F', 'G', 'K', 'E', 'L', 'P', 'C', 'R', 'K', 'F', 'R', 'G', 'L', 'G', 'L', 'A', 'G', 'I', 'F', 'D', 'I', 'W', 'G', 'H', 'G', 'T', 'M', 'V', 'I', 'V', 'S'] | chothia | imgt | human | H,K,L | QKQLVESGGGVVQPGRSLTLSCAAS | QKQLVESGGGVVQPGRSLTLSCAAS | QFPFSHYG | QFPFSHYG | MHWVRQAPGKGLEWVAS | MHWVRQAPGKGLEWVAS | ITNDGTKK | ITNDGTKK | YHGESVWDRFRISRDNSKNTLFLQMNSLRAEDTALYFC | YHGESVWDRFRISRDNSKNTLFLQMNSLRAEDTALYFC | VRDQREDECEEWWSDYYDFGKELPCRKFRGLGLAGIFDI | VRDQREDECEEWWSDYYDFGKELPCRKFRGLGLAGIFDI | WGHGTMVIVS | WGHGTMVIVS |
The output will contain <sadie.renumbering.result.NumberingResults'>
object. This object contains the following fields:
Field | Description |
---|---|
Id | The sequence ID |
sequence | sequence |
domain_no | not used |
hmm_species | the top species found in the HMM |
chain_type | the chain type, e.g 'H' or 'L' |
e-value | The e-value of the alignment |
score | The score for the alignment |
seqstart_index | where in the sequence does the alignment start |
seqend_index | where in the sequence does the alignment end |
identity_species | what species does the sequence aligns to best |
v_gene | The top V gene |
v_identity | V gene identity |
j_gene | The top J gene in alignment |
j_identity | J gene identity |
Chain | not used |
Numbering | The numbering of the sequence stored as an array |
Insertion | The insertions if any stored as an array |
Numbered_Sequence | The matched sequence stored as an array |
scheme | scheme, e.g. "kabat" |
region_definition | CDR/FW definition |
allowed_species | allowed_species |
allowed_chains | allowed_chains |
fwr1_aa_gaps | fwr1_aa_gaps |
fwr1_aa_no_gaps | fwr1_aa_no_gaps |
cdr1_aa_gaps | cdr1_aa_gaps |
cdr1_aa_no_gaps | cdr1_aa_no_gaps |
fwr2_aa_gaps | fwr2_aa_gaps |
fwr2_aa_no_gaps | fwr2_aa_no_gaps |
cdr2_aa_gaps | cdr2_aa_gaps |
cdr2_aa_no_gaps | cdr2_aa_no_gaps |
fwr3_aa_gaps | fwr3_aa_gaps |
fwr3_aa_no_gaps | fwr3_aa_no_gaps |
cdr3_aa_gaps | cdr3_aa_gaps |
cdr3_aa_no_gaps | cdr3_aa_no_gaps |
fwr4_aa_gaps | fwr4_aa_gaps |
fwr4_aa_no_gaps | fwr4_aa_no_gaps |
leader | what sequences come before the alignment |
follow | what sequences come after the alignment |
The NumberingResults
is a Pandas DataFrame instance that can be used like one. It also contains an alignment table that looks like the following.
# Use Renumbering module
import pandas as pd
from sadie.renumbering import Renumbering
# define a single sequence
vrc26_seq = "QKQLVESGGGVVQPGRSLTLSCAASQFPFSHYGMHWVRQAPGKGLEWVASITNDGTKKYHGESVWDRFRISRDNSKNTLFLQMNSLRAEDTALYFCVRDQREDECEEWWSDYYDFGKELPCRKFRGLGLAGIFDIWGHGTMVIVS"
# We wrap these in a function so we can use multiprocessing
def run() -> pd.DataFrame:
# setup API object
renumbering_api = Renumbering(scheme="chothia", region_assign="imgt", run_multiproc=True)
# run sequence and return airr table with sequence_id and sequence
numbering_table = renumbering_api.run_single("VRC26.27", vrc26_seq)
# get the handy dandy alignment table
return numbering_table.get_alignment_table()
if __name__ == "__main__":
print(run())
Warning
Multiprocessing must be wrapped in a function at the current time if you set run_multi=True. It will also work inside a Jupyter notebook cell.
The get_alignment_table()
method retrieves a handy alignment table of the sequence.
Id | chain_type | scheme | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 52A | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 82A | 82B | 82C | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 100A | 100B | 100C | 100D | 100E | 100F | 100G | 100H | 100I | 100J | 100K | 100L | 100M | 100N | 100O | 100P | 100Q | 100R | 100S | 100T | 100U | 100V | 100W | 100X | 100Y | 100Z | 100a | 100b | 100c | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | VRC26.27 | H | chothia | Q | K | Q | L | V | E | S | G | G | G | V | V | Q | P | G | R | S | L | T | L | S | C | A | A | S | Q | F | P | F | S | H | Y | G | M | H | W | V | R | Q | A | P | G | K | G | L | E | W | V | A | S | I | T | N | D | G | T | K | K | Y | H | G | E | S | V | W | D | R | F | R | I | S | R | D | N | S | K | N | T | L | F | L | Q | M | N | S | L | R | A | E | D | T | A | L | Y | F | C | V | R | D | Q | R | E | D | E | C | E | E | W | W | S | D | Y | Y | D | F | G | K | E | L | P | C | R | K | F | R | G | L | G | L | A | G | I | F | D | I | W | G | H | G | T | M | V | I | V | S |
Multiple Sequence Numbering¶
You can also renumber a fasta file.
$ sadie renumbering -q catnap_aa_heavy_sample.fasta
The output will be catnap_aa_heavy_sample_numbering_segment.csv
, which will be the table from the NumberingResults and catnap_aa_heavy_sample_numbering_alignment.csv
, which will be the alignment table.
# Use Renumbering module
import pandas as pd
from sadie.renumbering import Renumbering
# define a fasta file
catnap_fasta = "catnap_aa_heavy_sample.fasta"
# We wrap these in a function so we can use multiprocessing
def run() -> pd.DataFrame:
# setup API object
renumbering_api = Renumbering(scheme="chothia", region_assign="imgt", run_multiproc=True)
# run the renumbering on a file
numbering_table = renumbering_api.run_file(catnap_fasta)
return numbering_table
if __name__ == "__main__":
print(run())
Schemes¶
These are the current numbering schemes we have implemented.
Scheme | Description |
---|---|
chothia | Chothia numbering scheme |
kabat | Kabat numbering scheme |
imgt | IMGT numbering scheme |
Region definitions¶
Given a numbering scheme, we can define CDRS and frameworks with the following definitions.
['imgt', 'kabat', 'chothia', 'abm', 'contact', 'scdr']
Region Definition | Description |
---|---|
imgt | IMGT |
kabat | Kabat |
chothia | Chothia |
abm | ABM |
contact | Contact |
scdr | SCDR |
The following is a description of each definition taken from the Martin group
- The Kabat definition is based on sequence variability and is the most commonly used
- The Chothia definition is based on the location of the structural loop regions - see more detail at the bottom of this section
- The AbM definition is a compromise between the two used by Oxford Molecular's AbM antibody modelling software
- The contact definition has been recently introduced by us and is based on an analysis of the available complex crystal structures. This definition is likely the most useful for people wishing to perform mutagenesis to modify the affinity of an antibody since these residues take part in interactions with antigens. Lists of CDR residues making contact in each antibody with summary data for each CDR
- SCDR is the longest CDR definition for each region. It is used in the industry
For an excellent review of the numbering schemes and region definitions, see this paper