Skip to content

Reference Module

The SADIE reference module abstracts the underlying reference data used by the AIRR and Numbering modules. Both of these modules use external database files. Their organization (particularly by AIRR, which ports IGBlast) can be extremely complicated. Making a new reference database is a tedious and time-consuming task. This module provides a simple interface for making your own reference databases.

Builtin reference

SADIE ships with a reference database that contains the most common species along with functional genes. The average user will not need to use this module as the database is comprehensive. You can see each entry by looking either directly at the paths used src/sadie/airr/data/ for AIRR and src/sadie/anarci/data for the renumbering module. Another convenient way to look at the reference database is to view the reference.yml. More on how that file is structured will be provided.

Germline Gene Gateway

New germline gene segments are being discovered at a rapid pace. To meet the needs of this changing landscape, SADIE gets all of the germline gene info from a programmatic API called the Germline Gene Gateway. This API is hosted as a free service. It consists of germline genes from IMGT as well as custom genes that have been annotated and cataloged by programs such as IGDiscover. To explore the API, visit the Germline Gene Gateway. This RESTful API conforms to the OpenAPI 3.0 specification.

Examples of how to use the G3 API

The following examples show how to pull genes programmatically using the command line utilities curl, wget and the requests library in Python. It will fetch the first 5 V-Gene segments in IMGT notation.

$ curl -X 'GET' 'https://g3.jordanrwillis.com/api/v1/genes?source=imgt&segment=V&common=human&limit=5' -H 'accept: application/json' -o 'human_v.json'

$ wget 'https://g3.jordanrwillis.com/api/v1/genes?source=imgt&segment=V&common=human&limit=5' -O human_v.json

import json

import requests

from sadie.reference import G3Error

url = "https://g3.jordanrwillis.com/api/v1/genes?source=imgt&segment=V&common=human&limit=5"
response = requests.get(url)
response_json = response.json()
if response.status_code != 200:
    raise G3Error("Error: " + str(response.status_code))
print(json.dumps(response_json, indent=4))
json.dump(response_json, open("human_v.json", "w"), indent=4)

The output will be a JSON file containing the V-Gene segment and all relevant information needed by SADIE to write out databases needed by the AIRR and Numbering modules.

human_v.json

    [
        {
            "_id": "608b90908e6710a05b587046",
            "source": "imgt",
            "common": "human",
            "gene": "IGHV1-18*01",
            "label": "V-REGION",
            "gene_segment": "V",
            "receptor": "IG",
            "sequence": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
            "latin": "Homo_sapiens",
            "imgt": {
                "sequence": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
                "sequence_gapped": "CAGGTTCAGCTGGTGCAGTCTGGAGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTT............ACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTAC......AATGGTAACACAAACTATGCACAGAAGCTCCAG...GGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
                "sequence_gapped_aa": "QVQLVQSGA.EVKKPGASVKVSCKASGYTF....TSYGISWVRQAPGQGLEWMGWISAY..NGNTNYAQKLQ.GRVTMTTDTSTSTAYMELRSLRSDDTAVYYCAR",
                "fwr1": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCT",
                "fwr1_aa": "QVQLVQSGAEVKKPGASVKVSCKAS",
                "fwr1_start": 0,
                "fwr1_end": 74,
                "cdr1": "GGTTACACCTTTACCAGCTATGGT",
                "cdr1_aa": "GYTFTSYG",
                "cdr1_start": 75,
                "cdr1_end": 98,
                "fwr2": "ATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGG",
                "fwr2_aa": "ISWVRQAPGQGLEWMGW",
                "fwr2_start": 99,
                "fwr2_end": 149,
                "cdr2": "ATCAGCGCTTACAATGGTAACACA",
                "cdr2_aa": "ISAYNGNT",
                "cdr2_start": 150,
                "cdr2_end": 173,
                "fwr3": "AACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGT",
                "fwr3_aa": "NYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYC",
                "fwr3_start": 174,
                "fwr3_end": 287,
                "cdr3": "GCGAGAGA",
                "cdr3_aa": "AR",
                "cdr3_start": 288,
                "cdr3_end": 295,
                "imgt_functional": "F",
                "contrived_functional": "F"
            }
        },
        {
            "_id": "608b90908e6710a05b587048",
            "source": "imgt",
            "common": "human",
            "gene": "IGHV1-18*02",
            "label": "V-REGION",
            "gene_segment": "V",
            "receptor": "IG",
            "sequence": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTAAGATCTGACGACACGGCC",
            "latin": "Homo_sapiens",
            "imgt": {
                "sequence": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTAAGATCTGACGACACGGCC",
                "sequence_gapped": "CAGGTTCAGCTGGTGCAGTCTGGAGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTT............ACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTAC......AATGGTAACACAAACTATGCACAGAAGCTCCAG...GGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTAAGATCTGACGACACGGCC",
                "sequence_gapped_aa": "QVQLVQSGA.EVKKPGASVKVSCKASGYTF....TSYGISWVRQAPGQGLEWMGWISAY..NGNTNYAQKLQ.GRVTMTTDTSTSTAYMELRSLRSDDTA",
                "fwr1": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCT",
                "fwr1_aa": "QVQLVQSGAEVKKPGASVKVSCKAS",
                "fwr1_start": 0,
                "fwr1_end": 74,
                "cdr1": "GGTTACACCTTTACCAGCTATGGT",
                "cdr1_aa": "GYTFTSYG",
                "cdr1_start": 75,
                "cdr1_end": 98,
                "fwr2": "ATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGG",
                "fwr2_aa": "ISWVRQAPGQGLEWMGW",
                "fwr2_start": 99,
                "fwr2_end": 149,
                "cdr2": "ATCAGCGCTTACAATGGTAACACA",
                "cdr2_aa": "ISAYNGNT",
                "cdr2_start": 150,
                "cdr2_end": 173,
                "fwr3": "AACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTAAGATCTGACGACACGGCC",
                "fwr3_aa": "NYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTA",
                "fwr3_start": 174,
                "fwr3_end": 275,
                "cdr3": "",
                "cdr3_aa": "",
                "cdr3_start": null,
                "cdr3_end": null,
                "imgt_functional": "F",
                "contrived_functional": "F"
            }
        },
        {
            "_id": "608b90908e6710a05b587049",
            "source": "imgt",
            "common": "human",
            "gene": "IGHV1-18*03",
            "label": "V-REGION",
            "gene_segment": "V",
            "receptor": "IG",
            "sequence": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACATGGCCGTGTATTACTGTGCGAGAGA",
            "latin": "Homo_sapiens",
            "imgt": {
                "sequence": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACATGGCCGTGTATTACTGTGCGAGAGA",
                "sequence_gapped": "CAGGTTCAGCTGGTGCAGTCTGGAGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTT............ACCAGCTATGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTAC......AATGGTAACACAAACTATGCACAGAAGCTCCAG...GGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACATGGCCGTGTATTACTGTGCGAGAGA",
                "sequence_gapped_aa": "QVQLVQSGA.EVKKPGASVKVSCKASGYTF....TSYGISWVRQAPGQGLEWMGWISAY..NGNTNYAQKLQ.GRVTMTTDTSTSTAYMELRSLRSDDMAVYYCAR",
                "fwr1": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCT",
                "fwr1_aa": "QVQLVQSGAEVKKPGASVKVSCKAS",
                "fwr1_start": 0,
                "fwr1_end": 74,
                "cdr1": "GGTTACACCTTTACCAGCTATGGT",
                "cdr1_aa": "GYTFTSYG",
                "cdr1_start": 75,
                "cdr1_end": 98,
                "fwr2": "ATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGG",
                "fwr2_aa": "ISWVRQAPGQGLEWMGW",
                "fwr2_start": 99,
                "fwr2_end": 149,
                "cdr2": "ATCAGCGCTTACAATGGTAACACA",
                "cdr2_aa": "ISAYNGNT",
                "cdr2_start": 150,
                "cdr2_end": 173,
                "fwr3": "AACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACATGGCCGTGTATTACTGT",
                "fwr3_aa": "NYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDMAVYYC",
                "fwr3_start": 174,
                "fwr3_end": 287,
                "cdr3": "GCGAGAGA",
                "cdr3_aa": "AR",
                "cdr3_start": 288,
                "cdr3_end": 295,
                "imgt_functional": "F",
                "contrived_functional": "F"
            }
        },
        {
            "_id": "608b90908e6710a05b58704b",
            "source": "imgt",
            "common": "human",
            "gene": "IGHV1-18*04",
            "label": "V-REGION",
            "gene_segment": "V",
            "receptor": "IG",
            "sequence": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTACGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
            "latin": "Homo_sapiens",
            "imgt": {
                "sequence": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTTACCAGCTACGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTACAATGGTAACACAAACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
                "sequence_gapped": "CAGGTTCAGCTGGTGCAGTCTGGAGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGTTACACCTTT............ACCAGCTACGGTATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGGATCAGCGCTTAC......AATGGTAACACAAACTATGCACAGAAGCTCCAG...GGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGTGCGAGAGA",
                "sequence_gapped_aa": "QVQLVQSGA.EVKKPGASVKVSCKASGYTF....TSYGISWVRQAPGQGLEWMGWISAY..NGNTNYAQKLQ.GRVTMTTDTSTSTAYMELRSLRSDDTAVYYCAR",
                "fwr1": "CAGGTTCAGCTGGTGCAGTCTGGAGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCT",
                "fwr1_aa": "QVQLVQSGAEVKKPGASVKVSCKAS",
                "fwr1_start": 0,
                "fwr1_end": 74,
                "cdr1": "GGTTACACCTTTACCAGCTACGGT",
                "cdr1_aa": "GYTFTSYG",
                "cdr1_start": 75,
                "cdr1_end": 98,
                "fwr2": "ATCAGCTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGATGG",
                "fwr2_aa": "ISWVRQAPGQGLEWMGW",
                "fwr2_start": 99,
                "fwr2_end": 149,
                "cdr2": "ATCAGCGCTTACAATGGTAACACA",
                "cdr2_aa": "ISAYNGNT",
                "cdr2_start": 150,
                "cdr2_end": 173,
                "fwr3": "AACTATGCACAGAAGCTCCAGGGCAGAGTCACCATGACCACAGACACATCCACGAGCACAGCCTACATGGAGCTGAGGAGCCTGAGATCTGACGACACGGCCGTGTATTACTGT",
                "fwr3_aa": "NYAQKLQGRVTMTTDTSTSTAYMELRSLRSDDTAVYYC",
                "fwr3_start": 174,
                "fwr3_end": 287,
                "cdr3": "GCGAGAGA",
                "cdr3_aa": "AR",
                "cdr3_start": 288,
                "cdr3_end": 295,
                "imgt_functional": "F",
                "contrived_functional": "F"
            }
        },
        {
            "_id": "608b90908e6710a05b587053",
            "source": "imgt",
            "common": "human",
            "gene": "IGHV1-2*01",
            "label": "V-REGION",
            "gene_segment": "V",
            "receptor": "IG",
            "sequence": "CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGACGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCAGGGTCACCAGTACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGTCGTGTATTACTGTGCGAGAGA",
            "latin": "Homo_sapiens",
            "imgt": {
                "sequence": "CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTCACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGACGGATCAACCCTAACAGTGGTGGCACAAACTATGCACAGAAGTTTCAGGGCAGGGTCACCAGTACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGTCGTGTATTACTGTGCGAGAGA",
                "sequence_gapped": "CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCTGGATACACCTTC............ACCGGCTACTATATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGACGGATCAACCCTAAC......AGTGGTGGCACAAACTATGCACAGAAGTTTCAG...GGCAGGGTCACCAGTACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGTCGTGTATTACTGTGCGAGAGA",
                "sequence_gapped_aa": "QVQLVQSGA.EVKKPGASVKVSCKASGYTF....TGYYMHWVRQAPGQGLEWMGRINPN..SGGTNYAQKFQ.GRVTSTRDTSISTAYMELSRLRSDDTVVYYCAR",
                "fwr1": "CAGGTGCAGCTGGTGCAGTCTGGGGCTGAGGTGAAGAAGCCTGGGGCCTCAGTGAAGGTCTCCTGCAAGGCTTCT",
                "fwr1_aa": "QVQLVQSGAEVKKPGASVKVSCKAS",
                "fwr1_start": 0,
                "fwr1_end": 74,
                "cdr1": "GGATACACCTTCACCGGCTACTAT",
                "cdr1_aa": "GYTFTGYY",
                "cdr1_start": 75,
                "cdr1_end": 98,
                "fwr2": "ATGCACTGGGTGCGACAGGCCCCTGGACAAGGGCTTGAGTGGATGGGACGG",
                "fwr2_aa": "MHWVRQAPGQGLEWMGR",
                "fwr2_start": 99,
                "fwr2_end": 149,
                "cdr2": "ATCAACCCTAACAGTGGTGGCACA",
                "cdr2_aa": "INPNSGGT",
                "cdr2_start": 150,
                "cdr2_end": 173,
                "fwr3": "AACTATGCACAGAAGTTTCAGGGCAGGGTCACCAGTACCAGGGACACGTCCATCAGCACAGCCTACATGGAGCTGAGCAGGCTGAGATCTGACGACACGGTCGTGTATTACTGT",
                "fwr3_aa": "NYAQKFQGRVTSTRDTSISTAYMELSRLRSDDTVVYYC",
                "fwr3_start": 174,
                "fwr3_end": 287,
                "cdr3": "GCGAGAGA",
                "cdr3_aa": "AR",
                "cdr3_start": 288,
                "cdr3_end": 295,
                "imgt_functional": "F",
                "contrived_functional": "F"
            }
        }
    ]

G3 API

The G3 API can be explored live through the documentation. Go to the G3 API Documentation to do so. It is a clean non-redundant dataset that can be used for any project programatically. To learn more, explore the source code. SADIE abstracts most connections with G3, so you should not have to interact with the API directly.

Generating AIRR Reference Database

$ sadie reference make -o my_output_database_path -d reference.yml
$ 
from sadie.reference import References

reference_path = "reference.yml"
references_object = References.from_yaml(reference_path)

outpath = "my_output_database_path"
germline_path = references_object.make_airr_database(outpath)

The reference YAML

The reference YAML file is a simple YAML file that takes the following structure.

name:
  database:
    species:
    -gene1
    -gene2
    species2:
    -gene3
    -gene4
Field Description Example
name The name that this reference will be called in SADIE human, mouse, clk
database The database that the gene comes from IMGT or custom
species The name of the species that will be used in the annotation table human, mouse
gene The full gene name IGHV3-23*01

Why do we allow multiple species?

Most of the time the name and species will be the same thing. i.e.

human
    imgt:
        human:
            -IGHV3-23*01
            -IGHD3-3*01
            -IGHJ6*01

However, sometimes, you may work with chimeric models where a transgene is knocked into a model species. Consider the HuGL mouse models from Deli et al. (2020)

hugl18:
    imgt:
        human:
        - IGHV4-59*01
        - IGHD3-3*01
        - IGHJ3*02
        mouse:
        - IGHV1-11*01
        - IGHV1-12*01
        - IGHV1-13*01
        - IGHV1-14*01
    ...

The HuGL18 model will have the full mouse background and three gene segments knocked-in from a human.

Again, a full list of databases, species and genes can be found by exploring the G3 API, click the Try it out button.

Generating AIRR database with Reference Class

Rather than generate a pre-configured database, SADIE can also generate a reference file on the fly. This is useful for procedural analysis, where you generate custom genes for multiple species.

import tempfile

from sadie.reference import Reference, References

# create empty reference object
ref_class = Reference()
with tempfile.TemporaryDirectory() as tmpdirectory:
    # Add genes one at a time
    ref_class.add_gene({"species": "human", "gene": "IGHV1-69*01", "source": "imgt"})
    ref_class.add_gene({"species": "human", "gene": "IGHD3-3*01", "source": "imgt"})
    ref_class.add_gene({"species": "human", "gene": "IGHJ6*01", "source": "imgt"})

    # call make_airr database on a path
    references = References()
    references.add_reference("human", ref_class)
    references.make_airr_database(tmpdirectory)

or we can use the YAML file as a template to add more genes

import tempfile

from sadie.reference import Reference, References
from sadie.reference.yaml import YamlRef

# enter no file to use reference.yml
yml_ref = YamlRef()

# create empty reference object
ref_class = Reference()

# references class
references = References()

# Iterate through YamlRef
for name in yml_ref:
    # these are dictionary entries
    for database in yml_ref[name]:
        species = yml_ref[name][database]
        for single in species:
            single_species = single
            # gene is a list of genes
            genes = yml_ref[name][database][single_species]
            if database == "custom" and single_species == "macaque":  # only want cat and custom
                only_vs = list(filter(lambda x: x[3] == "V", genes))  # only get v genes, lookup third letter for this
                for gene in only_vs[:5]:  # only getting first 5
                    ref_class.add_gene({"gene": gene, "species": single_species, "source": database})

references.add_reference("small_macaque", ref_class)
  • Copyright © Jordan R. Willis, Troy Sincomb, and Caleb K. Kibet