AIRR Annotation¶
Annotation is the bedrock of all immunoformatics workflows. It is the process of identifying CDRs/frameworks, levels of somatic mutation, locus use, productive rearrangements, and other features that describe the B cell receptor or T cell receptor (BCR/TCR). In the description of a BCR/TCR, how can we compare the data file output from one data pipeline to another? In other words, what if the description of a repertoire has different fields and datatypes that describe a repertoire or even a single BCR/TCR? Fear not! The AIRR community to the rescue!
"AIRR Data Representations are versioned specifications that consist of a file format and a well-defined schema[...] The schema defines the data model, field names, data types, and encodings for AIRR standard objects. Strict typing enables interoperability and data sharing between different AIRR-seq analysis tools and repositories[...]"
SADIE leverages the AIRR to provide a standardized data representation for BCRs. You can read all the fields and values in the AIRR Rearrangement schema standard here
Single Sequence Annotation¶
# use Airr module
# import pandas for dataframe handling
import pandas as pd
from sadie.airr import Airr
# define a single sequence
pg9_seq = "CAGCGATTAGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGTCGTCCCTGAGACTCTCCTGTGCAGCGTCCGGATTCGACTTCAGTAGACAAGGCATGCACTGGGTCCGCCAGGCTCCAGGCCAGGGGCTGGAGTGGGTGGCATTTATTAAATATGATGGAAGTGAGAAATATCATGCTGACTCCGTATGGGGCCGACTCAGCATCTCCAGAGACAATTCCAAGGATACGCTTTATCTCCAAATGAATAGCCTGAGAGTCGAGGACACGGCTACATATTTTTGTGTGAGAGAGGCTGGTGGGCCCGACTACCGTAATGGGTACAACTATTACGATTTCTATGATGGTTATTATAACTACCACTATATGGACGTCTGGGGCAAAGGGACCACGGTCACCGTCTCGAGC"
# setup API object
airr_api = Airr("human")
# run sequence and return airr table with sequence_id and sequence
airr_table = airr_api.run_single("PG9", pg9_seq)
# output object types
print(type(airr_table))
print(isinstance(airr_table, pd.DataFrame))
The output will contain <class 'sadie.airr.airrtable.airrtable.AirrTable'>
and shows that the output is an instance of the AirrTable
class.
Info
Running an AIRR method generates an AIRR table object. The AIRR table is a subclass of a pandas dataframe and thus can be used by any pandas method. Pandas is the workhorse of the SADIE library, so we highly encourage some rudimentary knowledge of pandas to get maximize SAIDIE functionality.
Writing Files¶
AIRR Rearrangement File¶
To output an AIRR file, we can use the AirrTable.to_airr()
method.
# import the SADIE Airr module
from sadie.airr import Airr
# define a single sequence
pg9_seq = "CAGCGATTAGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGTCGTCCCTGAGACTCTCCTGTGCAGCGTCCGGATTCGACTTCAGTAGACAAGGCATGCACTGGGTCCGCCAGGCTCCAGGCCAGGGGCTGGAGTGGGTGGCATTTATTAAATATGATGGAAGTGAGAAATATCATGCTGACTCCGTATGGGGCCGACTCAGCATCTCCAGAGACAATTCCAAGGATACGCTTTATCTCCAAATGAATAGCCTGAGAGTCGAGGACACGGCTACATATTTTTGTGTGAGAGAGGCTGGTGGGCCCGACTACCGTAATGGGTACAACTATTACGATTTCTATGATGGTTATTATAACTACCACTATATGGACGTCTGGGGCAAAGGGACCACGGTCACCGTCTCGAGC"
# setup API object
airr_api = Airr("human")
# run sequence and return airr table with sequence_id and sequence
airr_table = airr_api.run_single("PG9", pg9_seq)
# write airr table to tsv or tsv.gz/bz
airr_table.to_airr("PG9 AIRR.tsv")
# compress your airr table into a bzip or gzip filecxx
airr_table.to_airr("PG9 AIRR.tsv.gz")
airr_table.to_airr("PG9 AIRR.tsv.bz2")
The tsv file PG9 AIRR.tsv
generated will be a tabular datafile that will resemble the following:
sequence_id | sequence | species | locus | stop_codon | vj_in_frame | v_frameshift | productive | rev_comp | complete_vdj | v_call_top | v_call | d_call_top | d_call | j_call_top | j_call | sequence_alignment | germline_alignment | sequence_alignment_aa | germline_alignment_aa | v_alignment_start | v_alignment_end | d_alignment_start | d_alignment_end | j_alignment_start | j_alignment_end | v_sequence_alignment | v_sequence_alignment_aa | v_germline_alignment | v_germline_alignment_aa | d_sequence_alignment | d_sequence_alignment_aa | d_germline_alignment | d_germline_alignment_aa | j_sequence_alignment | j_sequence_alignment_aa | j_germline_alignment | j_germline_alignment_aa | fwr1 | fwr1_aa | cdr1 | cdr1_aa | fwr2 | fwr2_aa | cdr2 | cdr2_aa | fwr3 | fwr3_aa | fwr4 | fwr4_aa | cdr3 | cdr3_aa | junction | junction_length | junction_aa | junction_aa_length | v_score | d_score | j_score | v_cigar | d_cigar | j_cigar | v_support | d_support | j_support | v_identity | d_identity | j_identity | v_sequence_start | v_sequence_end | v_germline_start | v_germline_end | d_sequence_start | d_sequence_end | d_germline_start | d_germline_end | j_sequence_start | j_sequence_end | j_germline_start | j_germline_end | fwr1_start | fwr1_end | cdr1_start | cdr1_end | fwr2_start | fwr2_end | cdr2_start | cdr2_end | fwr3_start | fwr3_end | fwr4_start | fwr4_end | cdr3_start | cdr3_end | np1 | np1_length | np2 | np2_length | liable | vdj_nt | vdj_aa | v_mutation | v_mutation_aa | d_mutation | d_mutation_aa | j_mutation | j_mutation_aa | v_penalty | d_penalty | j_penalty |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PG9 | CAGCGATTAGTGGAG... | human | IGH | F | T | F | T | F | F | IGHV3-33*05 | IGHV3-33*05 | IGHD3-3*01 | IGHD3-3*01 | IGHJ6*03 | IGHJ6*03 | CAGCGATTAGTGGAG... | GTGCAGCTGGTGGAG... | QRLVESGGGVVQPGS... | VQLVESGGGVVQPGR... | 1 | 293 | 328 | 355 | 356 | 408 | CAGCGATTAGTGGAG... | QRLVESGGGVVQPGS... | GTGCAGCTGGTGGAG... | VQLVESGGGVVQPGR... | TATTACGATTTCTAT... | YYDFYDGYY | TATTACGATTTTTGG... | YYDFWSGYY | ACTACCACTATATGG... | YHYMDVWGKGTTVTV... | ACTACTACTACATGG... | YYYMDVWGKGTTVTV... | CAGCGATTAGTGGAG... | QRLVESGGGVVQPGS... | GGATTCGACTTCAGT... | GFDFSRQG | ATGCACTGGGTCCGC... | MHWVRQAPGQGLEWV... | ATTAAATATGATGGA... | IKYDGSEK | TATCATGCTGACTCC... | YHADSVWGRLSISRD... | TGGGGCAAAGGGACC... | WGKGTTVTVSS | GTGAGAGAGGCTGGT... | VREAGGPDYRNGYNY... | TGTGTGAGAGAGGCT... | 96 | CVREAGGPDYRNGYN... | 32 | 335.2 | 30.11 | 83.4 | 3N293M115S | 327S1N28M53S2N | 355S9N53M | 4.83e-94 | 9.579e-05 | 3.428e-20 | 86 | 82.1 | 88.7 | 1 | 293 | 4 | 296 | 328 | 355 | 2 | 29 | 356 | 408 | 10 | 62 | 1 | 72 | 73 | 96 | 97 | 147 | 148 | 171 | 172 | 285 | 376 | 408 | 286 | 375 | GGCTGGTGGGCCCGA... | 34 | nan | 0 | False | CAGCGATTAGTGGAG... | QRLVESGGGVVQPGS... | 14 | 19.5876 | 17.875 | 22.2222 | 11.3125 | 5.88235 | -1 | -1 | -2 |
This .tsv
file is a Rearrangement Schema compliant AIRR table. These files have certain specifications, including a .tsv
file suffix. Since they are AIRR compliant, they can be used by other AIRR compliant software.. For instance, we could use the output .tsv
in any module in the immcantation portal.
Other Output Formats¶
While the .tsv
AIRR table is the recognized standard for AIRR, you can also output to any other formats that pandas supports.
from sadie.airr import Airr
# define a single sequence
pg9_seq = "CAGCGATTAGTGGAGTCTGGGGGAGGCGTGGTCCAGCCTGGGTCGTCCCTGAGACTCTCCTGTGCAGCGTCCGGATTCGACTTCAGTAGACAAGGCATGCACTGGGTCCGCCAGGCTCCAGGCCAGGGGCTGGAGTGGGTGGCATTTATTAAATATGATGGAAGTGAGAAATATCATGCTGACTCCGTATGGGGCCGACTCAGCATCTCCAGAGACAATTCCAAGGATACGCTTTATCTCCAAATGAATAGCCTGAGAGTCGAGGACACGGCTACATATTTTTGTGTGAGAGAGGCTGGTGGGCCCGACTACCGTAATGGGTACAACTATTACGATTTCTATGATGGTTATTATAACTACCACTATATGGACGTCTGGGGCAAAGGGACCACGGTCACCGTCTCGAGC"
# setup API object
airr_api = Airr("human")
# run sequence and return airr table with sequence_id and sequence
airr_table = airr_api.run_single("PG9", pg9_seq)
# write airr table to a csv
airr_table.to_csv("PG9 AIRR.csv")
# write to a json file
airr_table.to_json("PG9 AIRR.json", orient="records")
# write to a browser friendly html file
airr_table.to_html("PG9 AIRR.html")
# write to an excel file
airr_table.to_excel("PG9 AIRR.xlsx")
# write to a parquet file that is read by spark
airr_table.to_parquet("PG9 AIRR.parquet")
# write to a feather file that has rapid IO
airr_table.to_feather("PG9 AIRR.feather")
Attention
Because AirrTable
is a subclass of pandas.DataFrame
, you can use any pandas IO methods to write to a file of your choosing. However, it must be noted that these are not official Rearrangement Schema compliant AIRR tables. They may only be read in by software that reads those file types or be read back in by SADIE and probably will not work in other software that supports the AIRR standard. But, these file formats are extremely useful for much larger files.
Reading Files¶
To read in an AIRR file, we have to create an AirrTable
object.
Reading an AIRR.tsv¶
You can read official AIRR.tsv using the AirrTable.from_airr()
method or with pandas and casting to an AirrTable
object.
import pandas as pd
from sadie.airr import AirrTable
# use AirrTable method to convert AirrTable.tsv to an AirrTable object
pg9_path = "PG9 AIRR.tsv.gz"
airr_table = AirrTable.read_airr(pg9_path)
print(type(airr_table), isinstance(airr_table, AirrTable))
# or use pandas read_csv method
airr_table_from_pandas = AirrTable(pd.read_csv(pg9_path, sep="\t"))
print(type(airr_table_from_pandas), isinstance(airr_table_from_pandas, AirrTable))
Outputs:
<class 'sadie.airr.airrtable.airrtable.AirrTable'> True
<class 'sadie.airr.airrtable.airrtable.AirrTable'> True
True # The airr tables are equal
Reading other file formats¶
Any other file formats that are readable by pandas IO can be read in by passing them to AirrTable.
import pandas as pd
from sadie.airr import AirrTable
# write airr table to a csv
airr_table_1 = AirrTable(pd.read_csv("PG9 AIRR.csv"))
# write to a json file
airr_table_2 = AirrTable(pd.read_json("PG9 AIRR.json", orient="records"))
# write to an excel file
airr_table_3 = AirrTable(pd.read_excel("PG9 AIRR.xlsx"))
# write to a parquet file that is read by spark
airr_table_4 = AirrTable(pd.read_parquet("PG9 AIRR.parquet"))
# write to a feather file that has rapid IO
airr_table_5 = AirrTable(pd.read_feather("PG9 AIRR.feather"))