10x pipeline¶

After you have a validated merged dataframe in from validation, you can begin the 10X pipeline which consists of demultiplexing, vdj, and feature counting.

full pipeline

CellRanger 7.1.0

The cellranger version in this pipeline is 7.1.0

Flow¶

:material-console-line: Command Line Usage :material-api: Python

The following will analyze the flow path and output a dataframe that will be used in analysis. It computes each count of each gate that are useful in frequency analysis.

$ g00x g003 pipeline flow -o g003/G003/output/flow /path/to/flow

import pandas as pd
from g00x.data import Data
from g00x.flow.flow import parse_flow_data
from pathlib import Path

data = ctx.obj["data"]
folder = 'path/to/flow'
flow_data = parse_flow_data(data, folder)
out = Path(out)
output_feather = Path(out.parent / (out.stem + ".feather"))
output_csv = Path(out.parent / (out.stem + ".csv"))

Here is what the flow dataframe will look like.

	run_purpose	run_date	sort_id	ptid	group	weeks	visit_id	probe_set	sample_type	sort_software_dv	sort_file_type	sample_tube	gate	phenotype	value_type	extention	file_path	file_subset	value	branch	easy_name	notes	sort_pool	hashtag
119	PreS	2022-08-25	S6C	G003831	2	-5	V091	eODGT8	PBMC	DV	Summary	T1	P11	IgD+/Antigen++	count	.csv	['g003/G003/sorting/G003/Prescreens/Prescreen_RunDate220825_UploadDate221021/PopulationSummaryFilesFromDV/PreS_220825_S6C_G003831_V091_eODGT8_PBMC_DV_Summary_T1_a.csv']	['a']	50	IgD+	antigen_pos_igd_pos_b_cells	nan
2461	PreS	2022-11-04	S6C	G003136	2	8	V200	eODGT8	PBMC	DV	Summary	T1	P13	IgD+/KO-	count	.csv	['g003/G003/sorting/G003/Prescreens/Prescreen_RunDate221104_UploadDate221129/PopulationSummaryFilesFromDV/PreS_221104_S6C_G003136_V200_eODGT8_PBMC_DV_Summary_T1_a.csv']	['a']	190682	IgD+		nan
2754	PreS	2022-11-04	S6C	G003947	2	8	V200	Cg28v2	PBMC	DV	Summary	T1	P31	IgG-IgM-/IgA+/KO-	count	.csv	['g003/G003/sorting/G003/Prescreens/Prescreen_RunDate221104_UploadDate221129/PopulationSummaryFilesFromDV/PreS_221104_S6C_G003947_V200_Cg28v2_PBMC_DV_Summary_T1_a.csv']	['a']	10498	IgA+		nan
4485	Sort	2022-10-25	S6C	G003831	2	4	V160	eODGT8	PBMC	DV	Summary	T1	P27	IgG-IgM-IgD-	count	.csv	['g003/G003/sorting/G003/Sorts/Sort_RunDate221025_UploadDate221101/ClinicalSamples/PopulationSummaryFilesFromDV/Sort_221025_S6C_G003831_V160_eODGT8_PBMC_HT02_DV_Summary_T1_P03_a.csv']	['a']	100696			nan	P03	HT02
2343	PreS	2022-11-04	S6C	G003136	2	-5	V091	eODGT8	PBMC	DV	Summary	T1	P12	IgD+/Antigen++/KO-	count	.csv	['g003/G003/sorting/G003/Prescreens/Prescreen_RunDate221104_UploadDate221129/PopulationSummaryFilesFromDV/PreS_221104_S6C_G003136_V091_eODGT8_PBMC_DV_Summary_T1_a.csv']	['a']	47	IgD+	epitope_pos_igd_pos_b_cells_rev	nan

Demultiplexing¶

The input to the demultiplexing part of the pipeline will be the sequencing and flow file paths

:material-console-line: Command Line Usage :material-api: Python

$ g00x g003 pipeline demultiplex -o g003/G003/output/demultiplex -f /path/to/flow -s /path/to/sequencing

import pandas as pd
from g00x.data import Data
from g00x.sequencing.tenX import run_demultiplex

flow_path = "path/to/flow"
sequencing_path = "path/to/sequencing"
data = Data()
merged_dataframe: pd.DataFrame = merge_flow_and_sequencing(data, flow_path, sequencing_path)  # type: ignore
demultiplex_df = run_demultiplex(data, merged_dataframe, out, overwrite)
demultiplex_df.to_csv("demultiplex.csv")
demultiplex_df.to_csv("demultiplex.feather")

The demulitplex algorithm will add the following fields in demultiplex output

Column	Definition
vdj_run_dir	The full path to the vdj run directory, e.g. the Illumina directory
cso_run_dir	The full path to the cso run directory, e.g. the Illumina directory
vdj_sample_name	The unique vdj sample name given to each row
cso_sample_name	The unique cso sample name given to each row
vdj_fastq_dir	The full path to the vdj fastq directory
cso_fastq_dir	The full path to the cso fastq directory

An example demultiplexing output dataframe is found below.

	ptid	group	weeks	visit_id	probe_set	sample_type	run_date	sort_pool	hashtag	run_dir_path	pool_number	sorted_date	vdj_sequencing_replicate	cso_sequencing_replicate	vdj_lirary_replicate	cso_library_replicate	bio_replicate	vdj_index	feature_index	vdj_run_id	cso_run_id	vdj_run_dir_path	cso_run_dir_path	vdj_fastq_dir	vdj_sample_name	cso_fastq_dir	cso_sample_name
0	G003516	1	-5	V091	eODGT8	PBMC	2022-09-27	P01	HT01	/mnt/fsx/workspace/jwillis/repos/G00x/g003/G003/sequencing/G003/run0002	P01	2022-09-27	0	0	0	0	0	SI-TT-D6	SI-TN-D6	221006_VH00497_31_AAAVKCLHV	221006_VH00497_31_AAAVKCLHV	/mnt/fsx/workspace/jwillis/repos/G00x/g003/G003/sequencing/G003/run0002/221006_VH00497_31_AAAVKCLHV	/mnt/fsx/workspace/jwillis/repos/G00x/g003/G003/sequencing/G003/run0002/221006_VH00497_31_AAAVKCLHV	/mnt/fsx/workspace/jwillis/repos/G00x/g003/G003/sequencing/G003/run0002/working_directory/demultiplexed/29cc0e71cb9200226957921707138c5c/outs/fastq_path	vdj-SI-TT-D6	/mnt/fsx/workspace/jwillis/repos/G00x/g003/G003/sequencing/G003/run0002/working_directory/demultiplexed/29cc0e71cb9200226957921707138c5c/outs/fastq_path	cso-SI-TN-D6
1	G003516	1	4	V160	eODGT8	PBMC	2022-09-27	P01	HT02	/mnt/fsx/workspace/jwillis/repos/G00x/g003/G003/sequencing/G003/run0002	P01	2022-09-27	0	0	0	0	0	SI-TT-D6	SI-TN-D6	221006_VH00497_31_AAAVKCLHV	221006_VH00497_31_AAAVKCLHV	/mnt/fsx/workspace/jwillis/repos/G00x/g003/G003/sequencing/G003/run0002/221006_VH00497_31_AAAVKCLHV	/mnt/fsx/workspace/jwillis/repos/G00x/g003/G003/sequencing/G003/run0002/221006_VH00497_31_AAAVKCLHV	/mnt/fsx/workspace/jwillis/repos/G00x/g003/G003/sequencing/G003/run0002/working_directory/demultiplexed/29cc0e71cb9200226957921707138c5c/outs/fastq_path	vdj-SI-TT-D6	/mnt/fsx/workspace/jwillis/repos/G00x/g003/G003/sequencing/G003/run0002/working_directory/demultiplexed/29cc0e71cb9200226957921707138c5c/outs/fastq_path	cso-SI-TN-D6

VDJ¶

The output of demultiplexing pipeline will be used as input, see the pipeline

:material-console-line: Command Line Usage :material-api: Python

The following will run the VDJ pipeline from the demultiplex dataframe and output vdj.feather inside of the output folder.

Run it from demultiplexed dataframe
$ g00x g003 pipeline vdj -o g003/G003/output/vdj -d output/demultiplexed.feather

You can run the same with the following python code.

from g00x.sequencing.tenX import run_vdj
from g00x.data import Data

data = Data()
demultiplex_dataframe = pd.read_feather(demultiplex_dataframe_path)
run_vdj(data, demultiplex_dataframe, out)

The output vdj dataframe will only contain one additional field

Column	Definition
vdj_output	The full path to the vdj output folder

CSO¶

This CSO pipeline will run the cellranger count part and output a feature matrix. It also uses the demultiplex.feather as input.

:material-console-line: Command Line Usage :material-api: Python

The following will run the CSO pipeline from the demultiplex dataframe and output cso.feather inside of the output folder.

Run it from demultiplexed dataframe
$ g00x g003 pipeline cso -o g003/G003/output/cso -d output/demultiplexed.feather

You can run the same with the following python code.

from g00x.sequencing.tenX import run_cso
from g00x.data import Data

data = Data()
demultiplex_dataframe = pd.read_feather(demultiplex_dataframe_path)
run_cso(data, demultiplex_dataframe, out)

The output cso dataframe will only contain one additional field

Column	Definition
cso_output	The full path to the cso output folder

AIRR¶

The output of the VDJ and CSO can now be combined to get a final sequencing dataframe. This is the final sequencing dataframe that will be used for the analysis.

The AIRR protocol does the following.

It will...

Run SADIE AIRR on the VDJ contigs to get a formalized AIRR dataframe.
Analyze the feature barcodes and assign cellids to the correct participant.
Add the PubIDs to the AIRR dataframe.
Run mutational analysis
Run iGL assignment to all sequences
Determine if sequence is VRC01 class
Add mutational sets (find what mutations are VRC01-like)
Cluster sequences
Determine isotype

:material-console-line: Command Line Usage :material-api: Python

The following will run the AIRR pipeline from the VDJ and CSO dataframes and output airr.feather inside of the output folder.

Run it from VDJ and CSO dataframes
$ g00x g003 pipeline airr -o g003/G003/output/airr -v output/vdj.feather -c output/cso.feather

You can run the same with the following python code.

from g00x.data import Data
from g00x.sequencing.airr import run_airr

vdj_out = 'output/vdj.feather'
cso_out = 'output/cso.feather'
data = Data()
vdj_dataframe = pd.read_feather(vdj_out)
cso_dataframe = pd.read_feather(cso_out)
output_df = run_airr(data, vdj_dataframe, cso_dataframe, out, overwrite)

An output dataframe will take the following: