Clustering¶

Clustering is handled using an input of AirrTables. Consider the following example.

import pandas as pd

from sadie.airr.airrtable import LinkedAirrTable
from sadie.cluster import Cluster

# read in the bNAb data
df = pd.read_feather("hiv_bnabs.feather")

# this will create a linked heavy and light airr
# table where the unique name is called the cellid
linked_airr_table = LinkedAirrTable(df, key_column="cellid")


# setup api
cluster_api = Cluster(
    linked_airr_table,
    linkage="single",
    groupby=None,
    lookup=["cdr1_aa_heavy", "cdr2_aa_heavy", "cdr3_aa_heavy", "cdr1_aa_light", "cdr2_aa_light", "cdr3_aa_light"],
    pad_somatic=True,
)

# run the clustering with distance cutoff of 5
clustered_table = cluster_api.cluster(5)
# a column named cluster has been added to the airr table

# print out clusters but first sort by biggest cluster
gb = pd.DataFrame(clustered_table.set_index("cellid")).groupby(["cluster"])
for cluster in gb.size().sort_values(ascending=False).index:
    members = clustered_table.query(f"cluster=={cluster}")["cellid"].to_list()
    print("cluster name:", cluster, "contains", members, end="\n\n")

will print out:

cluster name: 35 contains ['PCT64-18B', 'PCT64-18C', 'PCT64-18D', 'PCT64-18F', 'PCT64-24F', 'PCT64-24G', 'PCT64-24H', 'PCT64-35B', 'PCT64-35C', 'PCT64-35D', 'PCT64-35E', 'PCT64-35F', 'PCT64-35G', 'PCT64-35H', 'PCT64-35I', 'PCT64-35K', 'PCT64-35N', 'PCT64-35O', 'PCT64-35S']

cluster name: 15 contains ['VRC26.08', 'VRC26.09', 'VRC26.20', 'VRC26.21', 'VRC26.22', 'VRC26.26', 'VRC26.27', 'VRC26.28', 'VRC26.29', 'VRC26.30', 'VRC26.31']

cluster name: 0 contains ['DH270.1', 'DH270.2', 'DH270.3', 'DH270.4', 'DH270.5', 'DH270.6', 'DH270.IA1', 'DH270.IA2', 'DH270.IA3', 'DH270.IA4']

cluster name: 1 contains ['VRC38.02', 'VRC38.03', 'VRC38.04', 'VRC38.05', 'VRC38.08', 'VRC38.09', 'VRC38.10', 'VRC38.11']

cluster name: 34 contains ['10J4', '10M6', '13I10', '2N5', '35O22', '4O20', '7B9', '7K3']

cluster name: 2 contains ['PGT151', 'PGT152', 'PGT153', 'PGT154', 'PGT155', 'PGT156', 'PGT157', 'PGT158']

...

So what happened here?

We read in a LinkedAirrTable, which has the heavy and light table paired together. This is also a Pandas DataFrame, so we can use any Pandas method.
Instantiate a Cluster object that contains the following parameters.

Parameter	Description
`linkage`	The hierarchical clustering linkage method. single, average or complete linkage
`groupby`	Do we pre-group by any fields. e.g. v_call_heavy will only cluster things that are the same v_call_heavy.
`lookup`	What fields to take the Levenshtein distance to make a cluster distance?
`pad_somatic`	Any somatic mutations that are present in both sequences will be subtracted from the total distance. This is useful for somatic hypermutation analysis.

Run cluseter_api.cluster(distace) where distance is a distance cutoff in hierarchical clustering. This will return an airrtable with the a field called cluster.

A complete example¶

You will probably run this in context of a larger analysis where you use airrtables to create an Airr Table.

from sadie.airr import Airr
from sadie.airr.methods import run_mutational_analysis
from sadie.cluster import Cluster

fasta_file = "catnap.fasta"
airr_api = Airr("human", adaptable=True)
airr_table = airr_api.run_fasta(fasta_file)
airr_table_mutational = run_mutational_analysis(airr_table, "kabat", run_multiproc=False)

cluster_api = Cluster(airr_table_mutational, lookup=["cdr1_aa", "cdr2_aa", "cdr3_aa"], pad_somatic=True)

airr_table_with_cluster = cluster_api.cluster(5)
airr_table_with_cluster.sort_values("cluster")
print(airr_table_with_cluster.sort_values("cluster")[["sequence_id", "cluster"]].head(5))

	sequence_id	cluster
45	PCT64-24E_MF565875_PCT64-24E-HC_anti-HIV_immunoglobulin_heavy_chain_variable_region	0
43	PCT64-24A_MF565871_PCT64-24A-HC_anti-HIV_immunoglobulin_heavy_chain_variable_region	0
44	PCT64-24B_MF565872_PCT64-24B-HC_anti-HIV_immunoglobulin_heavy_chain_variable_region	0
58	PCT64-35M_MF565891_PCT64-35M-HC_anti-HIV_immunoglobulin_heavy_chain_variable_region	0
8	CH27_patent_20160244510_CH27_heavy_chain	1

...

In the example above, we did the following:

Read in a fasta file.
Annotated a fasta file
Ran mutational analysis on the Airr Table
Clustered the Airr Table
Printed out the cluster name and sequence id