Reference Module¶
The SADIE reference module abstracts the underlying reference data used by the AIRR and Numbering modules. Both of these modules use external database files. Their organization (particularly by AIRR, which ports IGBlast) can be extremely complicated. Making a new reference database is a tedious and time-consuming task. This module provides a simple interface for making your own reference databases.
Builtin reference
SADIE ships with a reference database that contains the most common species along with functional genes. The average user will not need to use this module as the database is comprehensive. You can see each entry by looking either directly at the paths used src/sadie/airr/data/ for AIRR and src/sadie/anarci/data for the renumbering module. Another convenient way to look at the reference database is to view the reference.yml. More on how that file is structured will be provided.
Bundled Germline Gene Data¶
New germline gene segments are being discovered at a rapid pace. To meet the needs of this changing landscape, SADIE ships bundled germline gene records from IMGT and custom genes annotated by programs such as IGDiscover. Reference generation reads these package resources offline.
Example: read bundled IMGT genes offline¶
The example below reads src/sadie/reference/data/imgt-g3.json.gz through importlib.resources and writes the first 5 human V-gene segments in IMGT notation.
import gzip
import json
from importlib.resources import files
# The live G3 API is retired; gene records ship inside the package and are read offline.
collection = files("sadie.reference") / "data" / "imgt-g3.json.gz"
with gzip.open(collection, "rt") as handle:
genes = json.load(handle)
human_v = [g for g in genes if g["common"] == "human" and g["gene_segment"] == "V"][:5]
print(json.dumps(human_v, indent=4))
json.dump(human_v, open("human_v.json", "w"), indent=4)
The output will be a JSON file containing the V-gene segment records and all relevant information needed by SADIE to write out databases used by the AIRR and Numbering modules.
human_v.json
[
{
"_id": "608b90908e6710a05b587046",
"source": "imgt",
"common": "human",
"gene": "IGHV1-18*01",
"label": "V-REGION",
"gene_segment": "V",
"receptor": "IG",
"sequence": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
"latin": "Homo_sapiens",
"gene_curation_source": null,
"chimera": null,
"imgt": {
"sequence_gapped": "CAGGTTCAGCTGGTGCAGTCTGGAGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTT............ACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTAC......AATGGTAACACAAACTATGCACAGAAGCTCCAG...GGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
"sequence_gapped_aa": "QVQLVQSGA.EVKKPGASVKVSCKASGYTF....TSYGISWVRQAPGQGLEWMGWISAY..NGNTNYAQKLQ.GRVTMTTDTSTSTAYMELRSLRSDDTAVYYCAR",
"cdr3": "GCGAGAGA",
"cdr3_aa": "AR",
"fwr4": null,
"fwr4_aa": null,
"cdr3_start": 288,
"cdr3_end": 295,
"fwr4_start": null,
"fwr4_end": null,
"reading_frame": null,
"ignored": null,
"not_implemented": null,
"expression": null,
"expression_match": null,
"remainder": null,
"imgt_numbering": null,
"sequence": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
"fwr1": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCT",
"fwr1_aa": "QVQLVQSGAEVKKPGASVKVSCKAS",
"fwr1_start": 0,
"fwr1_end": 74,
"cdr1": "GGTTACACCTTTACCAGCTATGGT",
"cdr1_aa": "GYTFTSYG",
"cdr1_start": 75,
"cdr1_end": 98,
"fwr2": "ATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGG",
"fwr2_aa": "ISWVRQAPGQGLEWMGW",
"fwr2_start": 99,
"fwr2_end": 149,
"cdr2": "ATCAGCGCTTACAATGGTAACACA",
"cdr2_aa": "ISAYNGNT",
"cdr2_start": 150,
"cdr2_end": 173,
"fwr3": "AACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGT",
"fwr3_aa": "NYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYC",
"fwr3_start": 174,
"fwr3_end": 287,
"imgt_functional": "F",
"contrived_functional": "F"
}
},
{
"_id": "608b90908e6710a05b58704b",
"source": "imgt",
"common": "human",
"gene": "IGHV1-18*04",
"label": "V-REGION",
"gene_segment": "V",
"receptor": "IG",
"sequence": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTACGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
"latin": "Homo_sapiens",
"gene_curation_source": null,
"chimera": null,
"imgt": {
"sequence_gapped": "CAGGTTCAGCTGGTGCAGTCTGGAGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTT............ACCAGCTACGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTAC......AATGGTAACACAAACTATGCACAGAAGCTCCAG...GGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
"sequence_gapped_aa": "QVQLVQSGA.EVKKPGASVKVSCKASGYTF....TSYGISWVRQAPGQGLEWMGWISAY..NGNTNYAQKLQ.GRVTMTTDTSTSTAYMELRSLRSDDTAVYYCAR",
"cdr3": "GCGAGAGA",
"cdr3_aa": "AR",
"fwr4": null,
"fwr4_aa": null,
"cdr3_start": 288,
"cdr3_end": 295,
"fwr4_start": null,
"fwr4_end": null,
"reading_frame": null,
"ignored": null,
"not_implemented": null,
"expression": null,
"expression_match": null,
"remainder": null,
"imgt_numbering": null,
"sequence": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTACGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
"fwr1": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCT",
"fwr1_aa": "QVQLVQSGAEVKKPGASVKVSCKAS",
"fwr1_start": 0,
"fwr1_end": 74,
"cdr1": "GGTTACACCTTTACCAGCTACGGT",
"cdr1_aa": "GYTFTSYG",
"cdr1_start": 75,
"cdr1_end": 98,
"fwr2": "ATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGG",
"fwr2_aa": "ISWVRQAPGQGLEWMGW",
"fwr2_start": 99,
"fwr2_end": 149,
"cdr2": "ATCAGCGCTTACAATGGTAACACA",
"cdr2_aa": "ISAYNGNT",
"cdr2_start": 150,
"cdr2_end": 173,
"fwr3": "AACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGT",
"fwr3_aa": "NYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYC",
"fwr3_start": 174,
"fwr3_end": 287,
"imgt_functional": "F",
"contrived_functional": "F"
}
},
{
"_id": "608b90908e6710a05b587055",
"source": "imgt",
"common": "human",
"gene": "IGHV1-2*02",
"label": "V-REGION",
"gene_segment": "V",
"receptor": "IG",
"sequence": "CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCAGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
"latin": "Homo_sapiens",
"gene_curation_source": null,
"chimera": null,
"imgt": {
"sequence_gapped": "CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTC............ACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAAC......AGTGGTGGCACAAACTATGCACAGAAGTTTCAG...GGCAGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
"sequence_gapped_aa": "QVQLVQSGA.EVKKPGASVKVSCKASGYTF....TGYYMHWVRQAPGQGLEWMGWINPN..SGGTNYAQKFQ.GRVTMTRDTSISTAYMELSRLRSDDTAVYYCAR",
"cdr3": "GCGAGAGA",
"cdr3_aa": "AR",
"fwr4": null,
"fwr4_aa": null,
"cdr3_start": 288,
"cdr3_end": 295,
"fwr4_start": null,
"fwr4_end": null,
"reading_frame": null,
"ignored": null,
"not_implemented": null,
"expression": null,
"expression_match": null,
"remainder": null,
"imgt_numbering": null,
"sequence": "CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCAGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
"fwr1": "CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCT",
"fwr1_aa": "QVQLVQSGAEVKKPGASVKVSCKAS",
"fwr1_start": 0,
"fwr1_end": 74,
"cdr1": "GGATACACCTTCACCGGCTACTAT",
"cdr1_aa": "GYTFTGYY",
"cdr1_start": 75,
"cdr1_end": 98,
"fwr2": "ATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGG",
"fwr2_aa": "MHWVRQAPGQGLEWMGW",
"fwr2_start": 99,
"fwr2_end": 149,
"cdr2": "ATCAACCCTAACAGTGGTGGCACA",
"cdr2_aa": "INPNSGGT",
"cdr2_start": 150,
"cdr2_end": 173,
"fwr3": "AACTATGCACAGAAGTTTCAGGGCAGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGT",
"fwr3_aa": "NYAQKFQGRVTMTRDTSISTAYMELSRLRSDDTAVYYC",
"fwr3_start": 174,
"fwr3_end": 287,
"imgt_functional": "F",
"contrived_functional": "F"
}
},
{
"_id": "608b90908e6710a05b587056",
"source": "imgt",
"common": "human",
"gene": "IGHV1-2*04",
"label": "V-REGION",
"gene_segment": "V",
"receptor": "IG",
"sequence": "CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
"latin": "Homo_sapiens",
"gene_curation_source": null,
"chimera": null,
"imgt": {
"sequence_gapped": "CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTC............ACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAAC......AGTGGTGGCACAAACTATGCACAGAAGTTTCAG...GGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
"sequence_gapped_aa": "QVQLVQSGA.EVKKPGASVKVSCKASGYTF....TGYYMHWVRQAPGQGLEWMGWINPN..SGGTNYAQKFQ.GWVTMTRDTSISTAYMELSRLRSDDTAVYYCAR",
"cdr3": "GCGAGAGA",
"cdr3_aa": "AR",
"fwr4": null,
"fwr4_aa": null,
"cdr3_start": 288,
"cdr3_end": 295,
"fwr4_start": null,
"fwr4_end": null,
"reading_frame": null,
"ignored": null,
"not_implemented": null,
"expression": null,
"expression_match": null,
"remainder": null,
"imgt_numbering": null,
"sequence": "CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
"fwr1": "CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCT",
"fwr1_aa": "QVQLVQSGAEVKKPGASVKVSCKAS",
"fwr1_start": 0,
"fwr1_end": 74,
"cdr1": "GGATACACCTTCACCGGCTACTAT",
"cdr1_aa": "GYTFTGYY",
"cdr1_start": 75,
"cdr1_end": 98,
"fwr2": "ATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGG",
"fwr2_aa": "MHWVRQAPGQGLEWMGW",
"fwr2_start": 99,
"fwr2_end": 149,
"cdr2": "ATCAACCCTAACAGTGGTGGCACA",
"cdr2_aa": "INPNSGGT",
"cdr2_start": 150,
"cdr2_end": 173,
"fwr3": "AACTATGCACAGAAGTTTCAGGGCTGGGTCACCATGACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGCCGTGTATTACTGT",
"fwr3_aa": "NYAQKFQGWVTMTRDTSISTAYMELSRLRSDDTAVYYC",
"fwr3_start": 174,
"fwr3_end": 287,
"imgt_functional": "F",
"contrived_functional": "F"
}
},
{
"_id": "608b90908e6710a05b587063",
"source": "imgt",
"common": "human",
"gene": "IGHV1-24*01",
"label": "V-REGION",
"gene_segment": "V",
"receptor": "IG",
"sequence": "CAGGTCCAGCTGGTACAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGTTTCCGGATACACCCTCACTGAATTATCCATGCACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTGGATGGGAGGTTTTGATCCTGAAGATGGTGAAACAATCTACGCACAGAAGTTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGA",
"latin": "Homo_sapiens",
"gene_curation_source": null,
"chimera": null,
"imgt": {
"sequence_gapped": "CAGGTCCAGCTGGTACAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGTTTCCGGATACACCCTC............ACTGAATTATCCATGCACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTGGATGGGAGGTTTTGATCCTGAA......GATGGTGAAACAATCTACGCACAGAAGTTCCAG...GGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGA",
"sequence_gapped_aa": "QVQLVQSGA.EVKKPGASVKVSCKVSGYTL....TELSMHWVRQAPGKGLEWMGGFDPE..DGETIYAQKFQ.GRVTMTEDTSTDTAYMELSSLRSEDTAVYYCAT",
"cdr3": "GCAACAGA",
"cdr3_aa": "AT",
"fwr4": null,
"fwr4_aa": null,
"cdr3_start": 288,
"cdr3_end": 295,
"fwr4_start": null,
"fwr4_end": null,
"reading_frame": null,
"ignored": null,
"not_implemented": null,
"expression": null,
"expression_match": null,
"remainder": null,
"imgt_numbering": null,
"sequence": "CAGGTCCAGCTGGTACAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGTTTCCGGATACACCCTCACTGAATTATCCATGCACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTGGATGGGAGGTTTTGATCCTGAAGATGGTGAAACAATCTACGCACAGAAGTTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGTGCAACAGA",
"fwr1": "CAGGTCCAGCTGGTACAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGTTTCC",
"fwr1_aa": "QVQLVQSGAEVKKPGASVKVSCKVS",
"fwr1_start": 0,
"fwr1_end": 74,
"cdr1": "GGATACACCCTCACTGAATTATCC",
"cdr1_aa": "GYTLTELS",
"cdr1_start": 75,
"cdr1_end": 98,
"fwr2": "ATGCACTGGGTGCGACAGGCTCCTGGAAAAGGGCTTGAGTGGATGGGAGGT",
"fwr2_aa": "MHWVRQAPGKGLEWMGG",
"fwr2_start": 99,
"fwr2_end": 149,
"cdr2": "TTTGATCCTGAAGATGGTGAAACA",
"cdr2_aa": "FDPEDGET",
"cdr2_start": 150,
"cdr2_end": 173,
"fwr3": "ATCTACGCACAGAAGTTCCAGGGCAGAGTCACCATGACCGAGGACACATCTACAGACACAGCCTACATGGAGCTGAGCAGCCTGAGATCTGAGGACACGGCCGTGTATTACTGT",
"fwr3_aa": "IYAQKFQGRVTMTEDTSTDTAYMELSSLRSEDTAVYYC",
"fwr3_start": 174,
"fwr3_end": 287,
"imgt_functional": "F",
"contrived_functional": "F"
}
}
]
Offline reference data
No network call is needed. SADIE resolves reference genes from bundled package data; this direct gzip read is only for inspecting the records used by the docs.
Generating AIRR Reference Database¶
$ sadie reference make -o my_output_database_path -d reference.yml
$
from sadie.reference import References
reference_path = "reference.yml"
references_object = References.from_yaml(reference_path)
outpath = "my_output_database_path"
germline_path = references_object.make_airr_database(outpath)
The reference YAML¶
The reference YAML file is a simple YAML file that takes the following structure.
name:
database:
species:
-gene1
-gene2
species2:
-gene3
-gene4
| Field | Description | Example |
|---|---|---|
name |
The name that this reference will be called in SADIE | human, mouse, clk |
database |
The database that the gene comes from | IMGT or custom |
species |
The name of the species that will be used in the annotation table | human, mouse |
gene |
The full gene name | IGHV3-23*01 |
Why do we allow multiple species?
Most of the time the name and species will be the same thing. i.e.
human
imgt:
human:
-IGHV3-23*01
-IGHD3-3*01
-IGHJ6*01
However, sometimes, you may work with chimeric models where a transgene is knocked into a model species. Consider the HuGL mouse models from Deli et al. (2020)
hugl18:
imgt:
human:
- IGHV4-59*01
- IGHD3-3*01
- IGHJ3*02
mouse:
- IGHV1-11*01
- IGHV1-12*01
- IGHV1-13*01
- IGHV1-14*01
...
The HuGL18 model will have the full mouse background and three gene segments knocked-in from a human.
Again, a full list of built-in databases, species and genes can be found in the bundled reference.yml and the package data under src/sadie/reference/data/.
Generating AIRR database with Reference Class¶
Rather than generate a pre-configured database, SADIE can also generate a reference file on the fly. This is useful for procedural analysis, where you generate custom genes for multiple species.
import tempfile
from sadie.reference import Reference, References
# create empty reference object
ref_class = Reference()
with tempfile.TemporaryDirectory() as tmpdirectory:
# Add genes one at a time
ref_class.add_gene({"species": "human", "gene": "IGHV1-69*01", "source": "imgt"})
ref_class.add_gene({"species": "human", "gene": "IGHD3-3*01", "source": "imgt"})
ref_class.add_gene({"species": "human", "gene": "IGHJ6*01", "source": "imgt"})
# call make_airr database on a path
references = References()
references.add_reference("human", ref_class)
references.make_airr_database(tmpdirectory)
or we can use the YAML file as a template to add more genes
import tempfile
from sadie.reference import Reference, References
from sadie.reference.yaml import YamlRef
# enter no file to use reference.yml
yml_ref = YamlRef()
# create empty reference object
ref_class = Reference()
# references class
references = References()
# Iterate through YamlRef
for name in yml_ref:
# these are dictionary entries
for database in yml_ref[name]:
species = yml_ref[name][database]
for single in species:
single_species = single
# gene is a list of genes
genes = yml_ref[name][database][single_species]
if database == "custom" and single_species == "macaque": # only want cat and custom
only_vs = list(filter(lambda x: x[3] == "V", genes)) # only get v genes, lookup third letter for this
for gene in only_vs[:5]: # only getting first 5
ref_class.add_gene({"gene": gene, "species": single_species, "source": database})
references.add_reference("small_macaque", ref_class)
- Copyright © Jordan R. Willis, Troy Sincomb, and Caleb K. Kibet