API

class tcr_deep_insight.SPECIES(value)[source]

An enumeration.

HUMAN = 'human'
MOUSE = 'mouse'
class tcr_deep_insight.TCR(cdr3a: str | None = None, cdr3b: str | None = None, trav: str | None = None, trbv: str | None = None, traj: str | None = None, trbj: str | None = None, individual: str | None = None, species: Literal['human', 'mouse'] = 'human')[source]

TCR class to store TCR information and provide utility functions.

Parameters:
  • cdr3a – CDR3 alpha sequence

  • cdr3b – CDR3 beta sequence

  • trav – TRAV gene

  • trbv – TRBV gene

  • traj – TRAJ gene

  • trbj – TRBJ gene

  • individual – Individual identifier

  • species – Species identifier

property cdr3a
property cdr3b
classmethod deserialize(string)[source]
classmethod from_string(string: str)[source]
property individual
serialize()[source]
property species
to_string()[source]
to_tcr_string()[source]
property traj
property trav
property trbj
property trbv

Preprocessing

tcr_deep_insight.pp.update_anndata(gex_adata: AnnData, gex_embedding_key: str = 'X_gex', tcr_embedding_key: str = 'X_tcr', joint_embedding_key: str = 'X_gex_tcr') None[source]

Update the adata with the embedding keys

Note

TCR information should be included in gex_adata.obs. This method modifies the gex_adata inplace. added columns in .obs: tcr, CDR3a, CDR3b, TRAV, TRAJ, TRBV, TRBJ

Parameters:
  • gex_adata – AnnData object

  • gex_embedding_key – embedding key for gex

  • tcr_embedding_key – embedding key for tcr

  • joint_embedding_key – embedding key for joint

tcr_deep_insight.pp.unique_tcr_by_individual(gex_adata: ~anndata._core.anndata.AnnData, embedding_key: str | ~typing.Iterable[str] = 'X_gex', label_key: str | None = None, additional_label_keys: ~typing.Iterable[str] = None, aggregate_func: ~typing.Callable = <function majority_vote>) AnnData[source]

Unique TCRs by individual and aggregate GEX embedding by TCR. Unique TCR is defined by the combination of TRAV,TRAJ,TRBV,TRBJ,CDR3α,CDR3β and individual. Also aggregate GEX embedding by TCR, and add the aggregated GEX embedding to the tcr_adata.obsm[gex_embedding_key].

Note

“individual” should be in gex_adata.obs.columns.

Parameters:
  • gex_adata – AnnData object of gene expression data

  • embedding_key – Key(s) in adata.obsm where GEX embedding is stored. Default: ‘X_gex’

  • label_key – Key in adata.obs where TCR type abels are stored. Default: ‘cell_type’, where ‘cell_type’ should be included in adata.obs.columns

  • additional_label_keys – Additional keys in adata.obs where TCR type labels are stored. Default: None

  • map_function – Function to aggregate labels. Default: majority_vote

Returns:

TCR adata

Data

tcr_deep_insight.data.human_gex_reference_v2()[source]

Load the human gex reference v2. If the dataset is not found, it will be downloaded from Zenodo.

tcr_deep_insight.data.human_tcr_reference_v2()[source]

Load the human tcr reference v2. Can be generated from the human gex reference v2 via tdi.pp.unique_tcr_by_individual. If the dataset is not found, it will be downloaded from Zenodo.

tcr_deep_insight.data.mouse_gex_reference_v1()[source]

Load the mouse gex reference v1.

tcr_deep_insight.data.mouse_tcr_reference_v1()[source]

Load the mouse tcr reference v1. Can be generated from the mouse gex reference v1 via tdi.pp.unique_tcr_by_individual

Tool

tcr_deep_insight.tl.add_tcr_pseudosequence_to_dataframe(df: DataFrame, species: bool = 'human') None[source]

Add TCR pseudosequence to dataframe

Parameters:

df – dataframe

Note:
This method modifies the df inplace and adds the following columns:
  • tcr_pseudosequence

tcr_deep_insight.tl.add_pmhc_pseudosequence_to_dataframe(df: DataFrame) None[source]

Add PMHC pseudosequence to dataframe

Parameters:

df – dataframe

Note:
This method modifies the df inplace and adds the following columns:
  • hla_pseudosequence

  • pmhc_pseudosequence

tcr_deep_insight.tl.tcr_adata_to_datasets(adata: AnnData, tokenizer: TRabTokenizerForVJCDR3 | Tokenizer | None = None, show_progress: bool = False) Dataset[source]

Convert adata to tcr datasets

Parameters:
  • adata – AnnData

  • tokenizer – tokenizer

Returns:

tcr datasets

tcr_deep_insight.tl.tcr_dataframe_to_datasets(df: DataFrame, tokenizer: TRabTokenizerForVJCDR3 | Tokenizer, show_progress: bool = False) Dataset[source]

Convert dataframe to tcr datasets

Parameters:
  • df – dataframe

  • tokenizer – tokenizer

Returns:

tcr datasets

tcr_deep_insight.tl.to_embedding_tcr_only(model: TRabModelingBertForVJCDR3 | TRabModelingBertForPseudoSequence, tcr_dataset: Dataset, k: str = 'hidden_states', device: str = 'cuda', n_per_batch: int = 64, show_progress: bool = False) ndarray[source]

Get embedding from model

Parameters:
  • model – nn.Module. The TCR model.

  • tcr_dataset – datasets.arrow_dataset.Dataset. evaluation datasets.

  • k – str. ‘hidden_states’ or ‘last_hidden_state’.

  • device – str. ‘cuda’ or ‘cpu’. If ‘cuda’, use GPU. If ‘cpu’, use CPU.

  • n_per_batch – int. Number of samples per batch.

  • show_progress – bool. If True, show progress bar.

Returns:

embedding

tcr_deep_insight.tl.to_embedding_tcr_only_from_pandas(model: TRabModelingBertForVJCDR3 | TRabModelingBertForPseudoSequence, df: DataFrame, tokenizer: TRabTokenizerForVJCDR3 | Tokenizer, device: str, n_per_batch: int = 64, pooling: TCR_BERT_POOLING = 'mean')[source]
tcr_deep_insight.tl.get_pretrained_tcr_embedding(tcr_adata: AnnData, bert_config: Mapping[str, Any], checkpoint_path: str | PathLike, encoding: TCR_BERT_ENCODING = 'vjcdr3', pooling: TCR_BERT_POOLING = 'mean', species: Literal['human', 'mouse'] = 'human', pca_path: str | None = None, use_pca: bool = True, use_kernel_pca: bool = False, use_faiss_pca: bool = True, pca_n_components: int = 50, device: str = 'cuda:0', n_per_batch: int = 256)[source]

Get TCR embedding from pretrained BERT model

Note

This method modifies the tcr_adata inplace. It adds the following fields to tcr_adata.obsm:
  • X_tcr: TCR embedding

  • X_tcr_pca: PCA of TCR embedding

Parameters:
  • tcr_adata – AnnData object containing TCR data

  • bert_config – BERT config

  • checkpoint_path – Path to pretrained BERT model.

  • encoding – Encoding type of tcr sequence.

  • pooling – Pooling method for tcr representation.

  • species – Species.

  • pca_path – Path to PCA model, if previously saved.

  • use_pca – Whether to use PCA. Default: True

  • use_kernel_pca – Whether to use Kernel PCA. High memory required for large dataset.

  • use_faiss_pca – Whether to use Faiss PCA instead fo scikit-learn.

  • pca_n_components – Number of PCA components.

  • device – Device for Faiss PCA computation.

  • n_per_batch – Number of samples per batch in getting TCR embeddings.

Clustering

tcr_deep_insight.tl.cluster_tcr(tcr_adata: ~anndata._core.anndata.AnnData, label_key: str | None = None, include_hla_keys: ~typing.Iterable[str] | None = None, use_gpu: bool = False, gpu: int = 0, pure_label: bool = True, pure_criteria: ~typing.Callable = <function default_pure_criteria>, same_trav: bool = False, same_trbv: bool = False, same_cdr3a_length: bool = False, same_cdr3b_length: bool = False, layer_norm: bool = True, max_distance: float = 4.0, max_cluster_size: int = 40, use_gex: bool = True, filter_intersection_fraction: float = 0.7, nk: int = -1, calculate_perm_test: bool = True, n_jobs: int = 1, species: ~tcr_deep_insight.utils._definitions.SPECIES = 'human', faiss_index_backend: ~tcr_deep_insight.tool._constants.FAISS_INDEX_BACKEND = 'kmeans') TDIResult[source]

Cluster TCRs by joint TCR-GEX embedding. All TCRs will be used as cluster anchors.

Parameters:
  • tcr_adata – AnnData object containing TCR data

  • label_key – Key of the label to cluster. Should be in tcr_adata.obs.columns

  • use_gpu – Whether to use GPU.

  • gpu – GPU ID if use_gpu.

  • pure_label – Whether to constrain all TCRs in a cluster to have the same label.

  • pure_criteria – Pure criteria. A function that takes two arguments: a list of labels and a label to check if the list satisfies the criteria. If pure_label is False, the function take place to determine whether a cluster is pure or not.

  • same_trav – Whether to constrain all TCRs in a cluster to have the same TRAV gene.

  • same_trbv – Whether to constrain all TCRs in a cluster to have the same TRBV gene.

  • same_cdr3a_length – Whether to constrain all TCRs in a cluster to have the same CDR3a length.

  • same_cdr3b_length – Whether to constrain all TCRs in a cluster to have the same CDR3b length.

  • layer_norm – Whether to use LayerNorm on tcr embedding. Default: True

  • max_distance – Maximum TrGx distance. Default: 4.

  • max_cluster_size – Maximum cluster size for dTCR clusters.

  • use_gex – Whether to use GEX embedding for clustering.

  • filter_intersection_fraction – Filter intersection fraction in pruning clusters that contain overlapping TCRs.

  • nk – Number of nearest neighbors for background comparison. Default: -1, which means background neighbors equal to cluster size

  • calculate_perm_test – Whether to calculate p-values from permutation test. Default: True

  • species – Species name.

  • n_jobs – Number of threads for parallel processing.

  • faiss_index_backend – Faiss index backend. Default: FAISS_INDEX_BACKEND.KMEANS

Returns:

TDIResult containing clustered TCRs.

Note

The gpu parameter indicates GPU to use for clustering. If gpu is 0, CPU is used.

tcr_deep_insight.tl.cluster_tcr_from_reference(tcr_adata: ~anndata._core.anndata.AnnData, tcr_reference_adata: ~anndata._core.anndata.AnnData, label_key: str | None = None, include_hla_keys: ~typing.Iterable[str] | None = None, use_gpu: bool = False, gpu: int = 0, layer_norm: bool = True, pure_label: bool = True, pure_criteria: ~typing.Callable = <function default_pure_criteria>, same_trav: bool = False, same_trbv: bool = False, same_cdr3a_length: bool = False, same_cdr3b_length: bool = False, max_distance: float = 3.0, max_cluster_size: int = 40, use_gex: bool = True, filter_intersection_fraction: float = 0.7, nk: int = -1, calculate_perm_test: bool = True, n_jobs: int = 1, species: ~tcr_deep_insight.utils._definitions.SPECIES = 'human', faiss_index_backend: ~tcr_deep_insight.tool._constants.FAISS_INDEX_BACKEND = 'kmeans') TDIResult[source]

Cluster TCRs from reference. Only TCRs in the query dataset will be used as cluster anchors.

Parameters:
  • tcr_adata – AnnData object containing TCR data

  • tcr_reference_adata – AnnData object containing reference TCR data

  • label_key – Key of the label to cluster. Should be in tcr_adata.obs

  • use_gpu – Whether to use GPU.

  • gpu – GPU ID if use_gpu.

  • pure_label – Whether to constrain all TCRs in a cluster to have the same label.

  • pure_criteria – Pure criteria. A function that takes two arguments: a list of labels and a label to check if the list satisfies the criteria. If pure_label is False, the function take place to determine whether a cluster is pure or not.

  • same_trav – Whether to constrain all TCRs in a cluster to have the same TRAV gene.

  • same_trbv – Whether to constrain all TCRs in a cluster to have the same TRBV gene.

  • same_cdr3a_length – Whether to constrain all TCRs in a cluster to have the same CDR3a length.

  • same_cdr3b_length – Whether to constrain all TCRs in a cluster to have the same CDR3b length.

  • layer_norm – Whether to use LayerNorm on tcr embedding. Default: True

  • max_distance – Maximum TrGx distance. Default: 4.

  • max_cluster_size – Maximum cluster size for dTCR clusters.

  • use_gex – Whether to use GEX embedding for clustering.

  • filter_intersection_fraction – Filter intersection fraction in pruning clusters that contain overlapping TCRs.

  • nk – Number of nearest neighbors for background comparison. Default: -1, which means background neighbors equal to cluster size

  • calculate_perm_test – Whether to calculate morista horn permutation test for TCR clusters.

  • species – Species name.

  • n_jobs – Number of threads for parallel processing. Default: 1

  • faiss_index_backend – Faiss index backend. Default: FAISS_INDEX_BACKEND.KMEANS

Returns:

TDIResult object containing clustered TCRs.

Note

The gpu parameter indicates GPU to use for clustering. If gpu is 0, CPU is used.

tcr_deep_insight.tl.inject_labels_for_tcr_cluster_adata(reference_data: ~anndata._core.anndata.AnnData | ~pandas.core.frame.DataFrame, tcr_cluster_adata: ~anndata._core.anndata.AnnData, label_key: str, map_function: ~typing.Callable = <function majority_vote>)[source]

Inject labels for tcr_cluster_adata based on reference_adata

Parameters:
  • reference_adata – sc.AnnData. Reference AnnData object containing labels

  • tcr_cluster_adata – sc.AnnData

  • label_key – str. Key of the label to use for clustering in reference_adata.obs.columns

  • map_function – Callable. Default: function that returns the most frequent label.

Note:
This method modifies the df inplace and adds the following columns:
  • label_key: The most frequent label in the cluster

class tcr_deep_insight.tl.TDIResult(_data: AnnData, _tcr_df: DataFrame | None = None, _tcr_adata: AnnData | None = None, _gex_adata: AnnData | None = None, _cluster_label: str | None = None, faiss_index: IndexFlatL2 | None = None, low_memory: bool = False)[source]

Bases: object

A class to store the TDI clustering result

Parameters:
  • _data – the clustering result

  • _tcr_df – the tcr dataframe

  • _tcr_adata – the tcr anndata

  • _gex_adata – the gex anndata

  • _cluster_label – the cluster label

  • faiss_index – the faiss index

  • low_memory – whether to use low memory mode

property D
property I
calculate_cluster_additional_information()[source]
property cluster_label
property data
get_tcrs_by_cluster_index(cluster_index: int, _n_after: int = 0) List[str][source]

Get the tcrs for a specific cluster

Parameters:
  • cluster_index – the cluster index

  • _n_after – the number of tcrs to skip

get_tcrs_for_cluster(label: Mapping[str, str] | Mapping[str, List[str]] | None = None, rank: int = 0, rank_by: Literal['convergence', 'disease_association'] = 'convergence', min_unique_tcr_number: int = 4, min_individual_number: int = 2, min_cell_number: int = 10, min_tcr_convergence_score: float | None = None, min_disease_association_score: float | None = None, return_background_tcrs: bool = False, additional_label_key_values: Dict[str, List[str]] | None = None)[source]

Get the tcrs for a specific cluster

Parameters:
  • label – the cluster label

  • rank – the rank of the tcrs to return

  • rank_by – the metric to rank the tcrs

  • min_unique_tcr_number – the minimum number of unique tcrs in the cluster

  • min_individual_number – the minimum number of individuals in the cluster

  • min_cell_number – the minimum number of cells in the cluster

  • min_tcr_convergence_score – the minimum convergence score

  • min_disease_association_score – the minimum disease specificity score

  • return_background_tcrs – whether to return other tcrs in the cluster

  • additional_label_key_values – additional label key values to filter the cluster

Returns:

a dictionary containing the tcrs and their metadata

property gex_adata
classmethod load_from_disk(save_path: str, tcr_data_path: str | None = None, gex_adata_path: str | None = None)[source]

Load the cluster result from disk

Parameters:
  • save_path – the path to load the cluster result

  • tcr_data_path – the path to load the tcr data

  • gex_adata_path – the path to load the gex data

save_to_disk(save_path, save_cluster_result_as_csv=True, save_tcr_data=True, save_gex_data=True)[source]

Save the cluster result to disk

Parameters:

save_path – the path to save the cluster result

save_cluster_result_as_csv: whether to save the cluster result as csv files :param save_tcr_data: whether to save the tcr data :param save_gex_data: whether to save the gex data

select(indices)[source]
property tcr_adata
property tcr_df
to_pandas_dataframe_cluster_index()[source]
to_pandas_dataframe_tcr(rank_by: Literal['convergence', 'disease_association', 'individual', 'unique_tcr'] = 'convergence', return_background_tcrs: bool = False)[source]

Convert the cluster result to a pandas dataframe.

Parameters:
  • rank_by – the metric to rank the tcrs

  • return_background_tcrs – whether to return background tcrs for each cluster

Returns:

a pandas dataframe containing the cluster result

Clustering options

class tcr_deep_insight.tl.FAISS_INDEX_BACKEND(value)[source]

An enumeration.

FLAT = 'flat'
KMEANS = 'kmeans'

Models

tcr_deep_insight.model.modeling_bert.get_human_config(bert_type: Literal['tiny', 'small', 'base'] = 'small', vocab_size: int | None = None, alibi_starting_size: int = 512) BertConfig[source]

Get the configuration for the human TCR BERT

Parameters:
  • bert_type – The size of the BERT model. Must be one of ‘tiny’, ‘small’, or ‘base’

  • vocab_size – The size of the vocabulary

  • alibi_starting_size – The size of the input sequence

Returns:

The configuration for the human TCR BERT

class tcr_deep_insight.model.TRabModelingBertForPseudoSequence(bert_config: BertConfig, pooling: Callable | Literal['cls', 'mean', 'max', 'cdr3a', 'cdr3b', 'weighted'] = 'mean', pooling_weight: Tensor | None = None, labels_number: int = 1, use_triton: bool = False, device='cuda')[source]

Bases: Module

forward(*, input_ids: Tensor, attention_mask: Tensor, labels: Tensor | None = None, output_hidden_states=True)[source]

Forward pass of the model

Parameters:
  • input_ids – Input ids

  • attention_mask – Attention mask

  • labels – Labels

  • token_type_ids – Token type ids

Returns:

Output of the model

class tcr_deep_insight.model.TRabModelingBertForVJCDR3(bert_config: BertConfig, pooling: Literal['cls', 'mean', 'max', 'pool', 'trb', 'tra', 'weighted'] = 'mean', pooling_cls_position: int = 0, pooling_weight=(0.1, 0.9), labels_number: int = 1, device='cuda')[source]

Bases: ModuleBase

forward(*, input_ids: Tensor, attention_mask: Tensor, labels: Tensor, token_type_ids: Tensor, output_hidden_states=True)[source]

Forward pass of the model

Parameters:
  • input_ids – Input ids

  • attention_mask – Attention mask

  • labels – Labels

  • token_type_ids – Token type ids

Returns:

Output of the model

tcr_deep_insight.model.GEXModelingVAE

alias of scAtlasVAE

Model options

class tcr_deep_insight.model.TCR_BERT_ENCODING(value)[source]

An enumeration.

CDR123 = 'cdr123'
VJCDR3 = 'vjcdr3'
class tcr_deep_insight.model.TCR_BERT_POOLING(value)[source]

An enumeration.

CLS = 'cls'
MAX = 'max'
MEAN = 'mean'
POOL = 'pool'
SUM = 'sum'
TRA = 'tra'
TRB = 'trb'
WEIGHTED = 'weighted'

Training Utilities

class tcr_deep_insight.model.TRabModelingBertForVJCDR3Trainer(model: TRabModelingBertForVJCDR3, collator: TRabCollatorForVJCDR3, train_dataset: Dataset | None = None, test_dataset: Dataset | None = None, optimizers: Tuple[Optimizer, LambdaLR] = (None, None), loss_weight: Mapping[str, float] = {'reconstruction_loss': 1}, device: str = 'cuda')[source]

Bases: TrainerBase

evaluate(*, n_per_batch: int = 10, max_train_sequence: int | None = None, max_test_sequence: int | None = None, show_progress: bool = False)[source]

Main training entry point

fit(*, max_epoch: int, max_train_sequence: int = 0, n_per_batch: int = 10, shuffle: bool = False, balance_label: bool = False, label_weight: Tensor | None = None, show_progress: bool = False, early_stopping: bool = False)[source]

Main training entry point

class tcr_deep_insight.model.TRabTokenizerForVJCDR3(*, tra_max_length: int, trb_max_length: int, pad_token: str | None = None, unk_token: str | None = None, mask_token: str | None = None, cls_token: str | None = None, sep_token: str | None = None, species: Literal['human', 'mouse'] = 'human', **kwargs)[source]

Bases: AminoAcidTokenizer

Tokenizer for TRA and TRB sequence. Encode V,J genes into tokens, and CDR3 into amino acids

convert_ids_to_tokens(ids: Tensor)[source]
convert_tokens_to_ids(sequence: List[Tuple[str]] | Tuple[str], alpha_vj: List[Tuple[str]] | Tuple[str] | None = None, beta_vj: List[Tuple[str]] | Tuple[str] | None = None)[source]
to_dataset(df: DataFrame | None = None, ids: Iterable[str] | None = None, alpha_chains: Iterable[str] | None = None, beta_chains: Iterable[str] | None = None, alpha_v_genes: Iterable[str] | None = None, alpha_j_genes: Iterable[str] | None = None, beta_v_genes: Iterable[str] | None = None, beta_j_genes: Iterable[str] | None = None, pairing: Iterable[int] | None = None, split: bool = False)[source]
class tcr_deep_insight.model.TRabCollatorForVJCDR3(tra_max_length: int, trb_max_length: int, mask_token_id: int = 22, mlm_probability: float = 0.1, mask_trb_probability: float = 0.5, species: Literal['human', 'mouse'] = 'human')[source]

Bases: object

Plotting

tcr_deep_insight.pl.set_plotting_params(dpi: int, fontsize: int = 12, fontfamily: str = 'Arial', linewidth: float = 0.5)[source]

Set default plotting parameters for matplotlib.

Parameters:
  • dpi – dpi for saving figures

  • fontsize – default fontsize

  • fontfamily – default fontfamily

  • linewidth – default linewidth

tcr_deep_insight.pl.create_fig(figsize=(8, 4)) Tuple[Figure, Axes][source]

Create a figure with a single axis.

Parameters:

figsize – figure size. Default: (8, 4)

:return matplotlib.figure.Figure, matplotlib.axes.Axes

tcr_deep_insight.pl.create_subplots(nrow, ncol, figsize=(8, 8), gridspec_kw={}) Tuple[Figure, ndarray][source]

Create a figure with multiple axes.

Parameters:
  • nrow – number of rows

  • ncol – number of columns

  • figsize – figure size. Default: (8, 8)

  • gridspec_kw – gridspec_kw. Default: {}

:return matplotlib.figure.Figure, matplotlib.axes.Axes

tcr_deep_insight.pl.plot_cdr3_sequence(sequences: List[str], alignment: bool = False, labels: List[str] | None = None, labels_palette: Mapping[str, Any] | None = None, labels_postfix: str | None = None, labels_postfix_palette: Mapping[str, Any] | None = None, ax: Axes | None = None) Tuple[Figure, Axes][source]

Plot CDR3 sequences.

Parameters:
  • sequences – a list of CDR3 sequences

  • alignment – whether to align the sequences. Default: False

  • labels – a list of labels. Default: None

  • labels_palette – a dictionary of labels and colors. Default: None

  • ax – matplotlib.axes.Axes. Default: None

Returns:

matplotlib.figure.Figure, matplotlib.axes.Axes

tcr_deep_insight.pl.plot_selected_tcrs(tcr_cluster_result: TDIResult, color: str, tcrs: List[str], tcrs_background: List[str], palette: dict | None = None)[source]

Plot the tcrs on the umap of the gex data, with the TCRs as a pie chart and logo plot

Parameters:
  • tcr_cluster_result – TDIResult

  • color – str

  • tcrs – list

  • palette – dict (optional)

Returns:

fig, ax

Note

You should have mafft installed in your system to use this function

Additional Utilities