Skip to content

AIRR Annotation

Annotation is the bedrock of all immunoformatics workflows. It is the process of identifying CDRs/frameworks, levels of somatic mutation, locus use, productive rearrangements, and other features that describe the B cell receptor or T cell receptor (BCR/TCR). In the description of a BCR/TCR, how can we compare the data file output from one data pipeline to another? In other words, what if the description of a repertoire has different fields and datatypes that describe a repertoire or even a single BCR/TCR? Fear not! The AIRR community to the rescue!


"AIRR Data Representations are versioned specifications that consist of a file format and a well-defined schema[...] The schema defines the data model, field names, data types, and encodings for AIRR standard objects. Strict typing enables interoperability and data sharing between different AIRR-seq analysis tools and repositories[...]"


SADIE leverages the AIRR to provide a standardized data representation for BCRs. You can read all the fields and values in the AIRR Rearrangement schema standard here

Single Sequence Annotation

# use Airr module
# import pandas for dataframe handling
import pandas as pd

from sadie.airr import Airr

# define a single sequence
pg9_seq = "CAGCGATTAGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGTCGTCCCTGAGACTCTCCTGTGCAGCGTCCGGATTCGACTTCAGTAGACAAGGCATGCACTGGGTCCGCCAGGCTCCAGGCCAGGGGCTGGAGTGGGTGGCATTTATTAAATATGATGGAAGTGAGAAATATCATGCTGACTCCGTATGGGGCCGACTCAGCATCTCCAGAGACAATTCCAAGGATACGCTTTATCTCCAAATGAATAGCCTGAGAGTCGAGGACACGGCTACATATTTTTGTGTGAGAGAGGCTGGTGGGCCCGACTACCGTAATGGGTACAACTATTACGATTTCTATGATGGTTATTATAACTACCACTATATGGACGTCTGGGGCAAAGGGACCACGGTCACCGTCTCGAGC"

# setup API  object
airr_api = Airr("human")

# run sequence and return airr table with sequence_id and sequence
airr_table = airr_api.run_single("PG9", pg9_seq)

# output object types
print(type(airr_table))
print(isinstance(airr_table, pd.DataFrame))

The output will contain <class 'sadie.airr.airrtable.airrtable.AirrTable'> and shows that the output is an instance of the AirrTable class.

Info

Running an AIRR method generates an AIRR table object. The AIRR table is a subclass of a pandas dataframe and thus can be used by any pandas method. Pandas is the workhorse of the SADIE library, so we highly encourage some rudimentary knowledge of pandas to get maximize SAIDIE functionality.

Writing Files

AIRR Rearrangement File

To output an AIRR file, we can use the AirrTable.to_airr() method.

# import the SADIE Airr module
from sadie.airr import Airr

# define a single sequence
pg9_seq = "CAGCGATTAGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGTCGTCCCTGAGACTCTCCTGTGCAGCGTCCGGATTCGACTTCAGTAGACAAGGCATGCACTGGGTCCGCCAGGCTCCAGGCCAGGGGCTGGAGTGGGTGGCATTTATTAAATATGATGGAAGTGAGAAATATCATGCTGACTCCGTATGGGGCCGACTCAGCATCTCCAGAGACAATTCCAAGGATACGCTTTATCTCCAAATGAATAGCCTGAGAGTCGAGGACACGGCTACATATTTTTGTGTGAGAGAGGCTGGTGGGCCCGACTACCGTAATGGGTACAACTATTACGATTTCTATGATGGTTATTATAACTACCACTATATGGACGTCTGGGGCAAAGGGACCACGGTCACCGTCTCGAGC"

# setup API object
airr_api = Airr("human")

# run sequence and return airr table with sequence_id and sequence
airr_table = airr_api.run_single("PG9", pg9_seq)

# write airr table to tsv or tsv.gz/bz
airr_table.to_airr("PG9 AIRR.tsv")

# compress your airr table into a bzip or gzip filecxx
airr_table.to_airr("PG9 AIRR.tsv.gz")
airr_table.to_airr("PG9 AIRR.tsv.bz2")

The tsv file PG9 AIRR.tsv generated will be a tabular datafile that will resemble the following:

sequence_id sequence species locus stop_codon vj_in_frame v_frameshift productive rev_comp complete_vdj v_call_top v_call d_call_top d_call j_call_top j_call sequence_alignment germline_alignment sequence_alignment_aa germline_alignment_aa v_alignment_start v_alignment_end d_alignment_start d_alignment_end j_alignment_start j_alignment_end v_sequence_alignment v_sequence_alignment_aa v_germline_alignment v_germline_alignment_aa d_sequence_alignment d_sequence_alignment_aa d_germline_alignment d_germline_alignment_aa j_sequence_alignment j_sequence_alignment_aa j_germline_alignment j_germline_alignment_aa fwr1 fwr1_aa cdr1 cdr1_aa fwr2 fwr2_aa cdr2 cdr2_aa fwr3 fwr3_aa fwr4 fwr4_aa cdr3 cdr3_aa junction junction_length junction_aa junction_aa_length v_score d_score j_score v_cigar d_cigar j_cigar v_support d_support j_support v_identity d_identity j_identity v_sequence_start v_sequence_end v_germline_start v_germline_end d_sequence_start d_sequence_end d_germline_start d_germline_end j_sequence_start j_sequence_end j_germline_start j_germline_end fwr1_start fwr1_end cdr1_start cdr1_end fwr2_start fwr2_end cdr2_start cdr2_end fwr3_start fwr3_end fwr4_start fwr4_end cdr3_start cdr3_end np1 np1_length np2 np2_length liable vdj_nt vdj_aa v_mutation v_mutation_aa d_mutation d_mutation_aa j_mutation j_mutation_aa v_penalty d_penalty j_penalty
PG9 CAGCGATTAGTGGAG... human IGH F T F T F F IGHV3-33*05 IGHV3-33*05 IGHD3-3*01 IGHD3-3*01 IGHJ6*03 IGHJ6*03 CAGCGATTAGTGGAG... GTGCAGCTGGTGGAG... QRLVESGGGVVQPGS... VQLVESGGGVVQPGR... 1 293 328 355 356 408 CAGCGATTAGTGGAG... QRLVESGGGVVQPGS... GTGCAGCTGGTGGAG... VQLVESGGGVVQPGR... TATTACGATTTCTAT... YYDFYDGYY TATTACGATTTTTGG... YYDFWSGYY ACTACCACTATATGG... YHYMDVWGKGTTVTV... ACTACTACTACATGG... YYYMDVWGKGTTVTV... CAGCGATTAGTGGAG... QRLVESGGGVVQPGS... GGATTCGACTTCAGT... GFDFSRQG ATGCACTGGGTCCGC... MHWVRQAPGQGLEWV... ATTAAATATGATGGA... IKYDGSEK TATCATGCTGACTCC... YHADSVWGRLSISRD... TGGGGCAAAGGGACC... WGKGTTVTVSS GTGAGAGAGGCTGGT... VREAGGPDYRNGYNY... TGTGTGAGAGAGGCT... 96 CVREAGGPDYRNGYN... 32 335.2 30.11 83.4 3N293M115S 327S1N28M53S2N 355S9N53M 4.83e-94 9.579e-05 3.428e-20 86 82.1 88.7 1 293 4 296 328 355 2 29 356 408 10 62 1 72 73 96 97 147 148 171 172 285 376 408 286 375 GGCTGGTGGGCCCGA... 34 nan 0 False CAGCGATTAGTGGAG... QRLVESGGGVVQPGS... 14 19.5876 17.875 22.2222 11.3125 5.88235 -1 -1 -2

This .tsv file is a Rearrangement Schema compliant AIRR table. These files have certain specifications, including a .tsv file suffix. Since they are AIRR compliant, they can be used by other AIRR compliant software.. For instance, we could use the output .tsv in any module in the immcantation portal.

Other Output Formats

While the .tsv AIRR table is the recognized standard for AIRR, you can also output to any other formats that pandas supports.

from sadie.airr import Airr

# define a single sequence
pg9_seq = "CAGCGATTAGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGTCGTCCCTGAGACTCTCCTGTGCAGCGTCCGGATTCGACTTCAGTAGACAAGGCATGCACTGGGTCCGCCAGGCTCCAGGCCAGGGGCTGGAGTGGGTGGCATTTATTAAATATGATGGAAGTGAGAAATATCATGCTGACTCCGTATGGGGCCGACTCAGCATCTCCAGAGACAATTCCAAGGATACGCTTTATCTCCAAATGAATAGCCTGAGAGTCGAGGACACGGCTACATATTTTTGTGTGAGAGAGGCTGGTGGGCCCGACTACCGTAATGGGTACAACTATTACGATTTCTATGATGGTTATTATAACTACCACTATATGGACGTCTGGGGCAAAGGGACCACGGTCACCGTCTCGAGC"

# setup API object
airr_api = Airr("human")

# run sequence and return airr table with sequence_id and sequence
airr_table = airr_api.run_single("PG9", pg9_seq)

# write airr table to a csv
airr_table.to_csv("PG9 AIRR.csv")

# write to a json file
airr_table.to_json("PG9 AIRR.json", orient="records")

# write to a browser friendly html file
airr_table.to_html("PG9 AIRR.html")

# write to an excel file
airr_table.to_excel("PG9 AIRR.xlsx")

# write to a parquet file that is read by spark
airr_table.to_parquet("PG9 AIRR.parquet")

# write to a feather file that has rapid IO
airr_table.to_feather("PG9 AIRR.feather")

Attention

Because AirrTable is a subclass of pandas.DataFrame, you can use any pandas IO methods to write to a file of your choosing. However, it must be noted that these are not official Rearrangement Schema compliant AIRR tables. They may only be read in by software that reads those file types or be read back in by SADIE and probably will not work in other software that supports the AIRR standard. But, these file formats are extremely useful for much larger files.

Reading Files

To read in an AIRR file, we have to create an AirrTable object.

Reading an AIRR.tsv

You can read official AIRR.tsv using the AirrTable.from_airr() method or with pandas and casting to an AirrTable object.

import pandas as pd

from sadie.airr import AirrTable

# use AirrTable method to convert AirrTable.tsv to an AirrTable object
pg9_path = "PG9 AIRR.tsv.gz"


airr_table = AirrTable.read_airr(pg9_path)
print(type(airr_table), isinstance(airr_table, AirrTable))

# or use pandas read_csv method
airr_table_from_pandas = AirrTable(pd.read_csv(pg9_path, sep="\t"))
print(type(airr_table_from_pandas), isinstance(airr_table_from_pandas, AirrTable))

Outputs:

<class 'sadie.airr.airrtable.airrtable.AirrTable'> True
<class 'sadie.airr.airrtable.airrtable.AirrTable'> True
True # The airr tables are equal

Reading other file formats

Any other file formats that are readable by pandas IO can be read in by passing them to AirrTable.

import pandas as pd

from sadie.airr import AirrTable

# write airr table to a csv
airr_table_1 = AirrTable(pd.read_csv("PG9 AIRR.csv"))

# write to a json file
airr_table_2 = AirrTable(pd.read_json("PG9 AIRR.json", orient="records"))

# write to an excel file
airr_table_3 = AirrTable(pd.read_excel("PG9 AIRR.xlsx"))

# write to a parquet file that is read by spark
airr_table_4 = AirrTable(pd.read_parquet("PG9 AIRR.parquet"))

# write to a feather file that has rapid IO
airr_table_5 = AirrTable(pd.read_feather("PG9 AIRR.feather"))