AIRR Annotation¶

Annotation is the bedrock of all immunoformatics workflows. It is the process of identifying CDRs/frameworks, levels of somatic mutation, locus use, productive rearrangements, and other features that describe the B cell receptor or T cell receptor (BCR/TCR). In the description of a BCR/TCR, how can we compare the data file output from one data pipeline to another? In other words, what if the description of a repertoire has different fields and datatypes that describe a repertoire or even a single BCR/TCR? Fear not! The AIRR community to the rescue!

"AIRR Data Representations are versioned specifications that consist of a file format and a well-defined schema[...] The schema defines the data model, field names, data types, and encodings for AIRR standard objects. Strict typing enables interoperability and data sharing between different AIRR-seq analysis tools and repositories[...]"

The AIRR Standards 1.3 documentation

SADIE leverages the AIRR to provide a standardized data representation for BCRs. You can read all the fields and values in the AIRR Rearrangement schema standard here

Single Sequence Annotation¶

# use Airr module
# import pandas for dataframe handling
import pandas as pd

from sadie.airr import Airr

# define a single sequence
pg9_seq = "CAGCGATTAGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGTCGTCCCTGAGACTCTCCTGTGCAGCGTCCGGATTCGACTTCAGTAGACAAGGCATGCACTGGGTCCGCCAGGCTCCAGGCCAGGGGCTGGAGTGGGTGGCATTTATTAAATATGATGGAAGTGAGAAATATCATGCTGACTCCGTATGGGGCCGACTCAGCATCTCCAGAGACAATTCCAAGGATACGCTTTATCTCCAAATGAATAGCCTGAGAGTCGAGGACACGGCTACATATTTTTGTGTGAGAGAGGCTGGTGGGCCCGACTACCGTAATGGGTACAACTATTACGATTTCTATGATGGTTATTATAACTACCACTATATGGACGTCTGGGGCAAAGGGACCACGGTCACCGTCTCGAGC"

# setup API  object
airr_api = Airr("human")

# run sequence and return airr table with sequence_id and sequence
airr_table = airr_api.run_single("PG9", pg9_seq)

# output object types
print(type(airr_table))
print(isinstance(airr_table, pd.DataFrame))

The output will contain <class 'sadie.airr.airrtable.airrtable.AirrTable'> and shows that the output is an instance of the AirrTable class.

Info

Running an AIRR method generates an AIRR table object. The AIRR table is a subclass of a pandas dataframe and thus can be used by any pandas method. Pandas is the workhorse of the SADIE library, so we highly encourage some rudimentary knowledge of pandas to get maximize SAIDIE functionality.

Writing Files¶

AIRR Rearrangement File¶

To output an AIRR file, we can use the AirrTable.to_airr() method.

# import the SADIE Airr module
from sadie.airr import Airr

# define a single sequence
pg9_seq = "CAGCGATTAGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGTCGTCCCTGAGACTCTCCTGTGCAGCGTCCGGATTCGACTTCAGTAGACAAGGCATGCACTGGGTCCGCCAGGCTCCAGGCCAGGGGCTGGAGTGGGTGGCATTTATTAAATATGATGGAAGTGAGAAATATCATGCTGACTCCGTATGGGGCCGACTCAGCATCTCCAGAGACAATTCCAAGGATACGCTTTATCTCCAAATGAATAGCCTGAGAGTCGAGGACACGGCTACATATTTTTGTGTGAGAGAGGCTGGTGGGCCCGACTACCGTAATGGGTACAACTATTACGATTTCTATGATGGTTATTATAACTACCACTATATGGACGTCTGGGGCAAAGGGACCACGGTCACCGTCTCGAGC"

# setup API object
airr_api = Airr("human")

# run sequence and return airr table with sequence_id and sequence
airr_table = airr_api.run_single("PG9", pg9_seq)

# write airr table to tsv or tsv.gz/bz
airr_table.to_airr("PG9 AIRR.tsv")

# compress your airr table into a bzip or gzip filecxx
airr_table.to_airr("PG9 AIRR.tsv.gz")
airr_table.to_airr("PG9 AIRR.tsv.bz2")

The tsv file PG9 AIRR.tsv generated will be a tabular datafile that will resemble the following:

sequence_id	sequence	species	locus	stop_codon	vj_in_frame	v_frameshift	productive	rev_comp	complete_vdj	v_call_top	v_call	d_call_top	d_call	j_call_top	j_call	sequence_alignment	germline_alignment	sequence_alignment_aa	germline_alignment_aa	v_alignment_start	v_alignment_end	d_alignment_start	d_alignment_end	j_alignment_start	j_alignment_end	v_sequence_alignment	v_sequence_alignment_aa	v_germline_alignment	v_germline_alignment_aa	d_sequence_alignment	d_sequence_alignment_aa	d_germline_alignment	d_germline_alignment_aa	j_sequence_alignment	j_sequence_alignment_aa	j_germline_alignment	j_germline_alignment_aa	fwr1	fwr1_aa	cdr1	cdr1_aa	fwr2	fwr2_aa	cdr2	cdr2_aa	fwr3	fwr3_aa	fwr4	fwr4_aa	cdr3	cdr3_aa	junction	junction_length	junction_aa	junction_aa_length	v_score	d_score	j_score	v_cigar	d_cigar	j_cigar	v_support	d_support	j_support	v_identity	d_identity	j_identity	v_sequence_start	v_sequence_end	v_germline_start	v_germline_end	d_sequence_start	d_sequence_end	d_germline_start	d_germline_end	j_sequence_start	j_sequence_end	j_germline_start	j_germline_end	fwr1_start	fwr1_end	cdr1_start	cdr1_end	fwr2_start	fwr2_end	cdr2_start	cdr2_end	fwr3_start	fwr3_end	fwr4_start	fwr4_end	cdr3_start	cdr3_end	np1	np1_length	np2	np2_length	liable	vdj_nt	vdj_aa	v_mutation	v_mutation_aa	d_mutation	d_mutation_aa	j_mutation	j_mutation_aa	v_penalty	d_penalty	j_penalty
PG9	CAGCGATTAGTGGAG...	human	IGH	F	T	F	T	F	F	IGHV3-33*05	IGHV3-33*05	IGHD3-3*01	IGHD3-3*01	IGHJ6*03	IGHJ6*03	CAGCGATTAGTGGAG...	GTGCAGCTGGTGGAG...	QRLVESGGGVVQPGS...	VQLVESGGGVVQPGR...	1	293	328	355	356	408	CAGCGATTAGTGGAG...	QRLVESGGGVVQPGS...	GTGCAGCTGGTGGAG...	VQLVESGGGVVQPGR...	TATTACGATTTCTAT...	YYDFYDGYY	TATTACGATTTTTGG...	YYDFWSGYY	ACTACCACTATATGG...	YHYMDVWGKGTTVTV...	ACTACTACTACATGG...	YYYMDVWGKGTTVTV...	CAGCGATTAGTGGAG...	QRLVESGGGVVQPGS...	GGATTCGACTTCAGT...	GFDFSRQG	ATGCACTGGGTCCGC...	MHWVRQAPGQGLEWV...	ATTAAATATGATGGA...	IKYDGSEK	TATCATGCTGACTCC...	YHADSVWGRLSISRD...	TGGGGCAAAGGGACC...	WGKGTTVTVSS	GTGAGAGAGGCTGGT...	VREAGGPDYRNGYNY...	TGTGTGAGAGAGGCT...	96	CVREAGGPDYRNGYN...	32	335.2	30.11	83.4	3N293M115S	327S1N28M53S2N	355S9N53M	4.83e-94	9.579e-05	3.428e-20	86	82.1	88.7	1	293	4	296	328	355	2	29	356	408	10	62	1	72	73	96	97	147	148	171	172	285	376	408	286	375	GGCTGGTGGGCCCGA...	34	nan	0	False	CAGCGATTAGTGGAG...	QRLVESGGGVVQPGS...	14	19.5876	17.875	22.2222	11.3125	5.88235	-1	-1	-2

This .tsv file is a Rearrangement Schema compliant AIRR table. These files have certain specifications, including a .tsv file suffix. Since they are AIRR compliant, they can be used by other AIRR compliant software.. For instance, we could use the output .tsv in any module in the immcantation portal.

Other Output Formats¶

While the .tsv AIRR table is the recognized standard for AIRR, you can also output to any other formats that pandas supports.

from sadie.airr import Airr

# define a single sequence
pg9_seq = "CAGCGATTAGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGTCGTCCCTGAGACTCTCCTGTGCAGCGTCCGGATTCGACTTCAGTAGACAAGGCATGCACTGGGTCCGCCAGGCTCCAGGCCAGGGGCTGGAGTGGGTGGCATTTATTAAATATGATGGAAGTGAGAAATATCATGCTGACTCCGTATGGGGCCGACTCAGCATCTCCAGAGACAATTCCAAGGATACGCTTTATCTCCAAATGAATAGCCTGAGAGTCGAGGACACGGCTACATATTTTTGTGTGAGAGAGGCTGGTGGGCCCGACTACCGTAATGGGTACAACTATTACGATTTCTATGATGGTTATTATAACTACCACTATATGGACGTCTGGGGCAAAGGGACCACGGTCACCGTCTCGAGC"

# setup API object
airr_api = Airr("human")

# run sequence and return airr table with sequence_id and sequence
airr_table = airr_api.run_single("PG9", pg9_seq)

# write airr table to a csv
airr_table.to_csv("PG9 AIRR.csv")

# write to a json file
airr_table.to_json("PG9 AIRR.json", orient="records")

# write to a browser friendly html file
airr_table.to_html("PG9 AIRR.html")

# write to an excel file
airr_table.to_excel("PG9 AIRR.xlsx")

# write to a parquet file that is read by spark
airr_table.to_parquet("PG9 AIRR.parquet")

# write to a feather file that has rapid IO
airr_table.to_feather("PG9 AIRR.feather")

Attention

Because AirrTable is a subclass of pandas.DataFrame, you can use any pandas IO methods to write to a file of your choosing. However, it must be noted that these are not official Rearrangement Schema compliant AIRR tables. They may only be read in by software that reads those file types or be read back in by SADIE and probably will not work in other software that supports the AIRR standard. But, these file formats are extremely useful for much larger files.

Reading Files¶

To read in an AIRR file, we have to create an AirrTable object.

Reading an AIRR.tsv¶

You can read official AIRR.tsv using the AirrTable.from_airr() method or with pandas and casting to an AirrTable object.

import pandas as pd

from sadie.airr import AirrTable

# use AirrTable method to convert AirrTable.tsv to an AirrTable object
pg9_path = "PG9 AIRR.tsv.gz"


airr_table = AirrTable.read_airr(pg9_path)
print(type(airr_table), isinstance(airr_table, AirrTable))

# or use pandas read_csv method
airr_table_from_pandas = AirrTable(pd.read_csv(pg9_path, sep="\t"))
print(type(airr_table_from_pandas), isinstance(airr_table_from_pandas, AirrTable))

Outputs:

<class 'sadie.airr.airrtable.airrtable.AirrTable'> True
<class 'sadie.airr.airrtable.airrtable.AirrTable'> True
True # The airr tables are equal

Reading other file formats¶

Any other file formats that are readable by pandas IO can be read in by passing them to AirrTable.

import pandas as pd

from sadie.airr import AirrTable

# write airr table to a csv
airr_table_1 = AirrTable(pd.read_csv("PG9 AIRR.csv"))

# write to a json file
airr_table_2 = AirrTable(pd.read_json("PG9 AIRR.json", orient="records"))

# write to an excel file
airr_table_3 = AirrTable(pd.read_excel("PG9 AIRR.xlsx"))

# write to a parquet file that is read by spark
airr_table_4 = AirrTable(pd.read_parquet("PG9 AIRR.parquet"))

# write to a feather file that has rapid IO
airr_table_5 = AirrTable(pd.read_feather("PG9 AIRR.feather"))