Skip to content

Validation

Validation

The most important part of this clinical trial, other than its working, is that everything is validated. That means that the data is validated, the code is validated, and the results are validated. This is a crucial part of the process and should not be skipped.

FACS/Sorting

These are instructions on validating the file structure containing the FACS/Sorting data before uploading to Box.

#Run the validator for flow from the command line
$ g00x g002 validate flow my_path/to/box/G002/

If you got your data from AWS, your command line would look like the following

#Run the validator for flow given the AWS structure
$ g00x g002 validate g002/G002/sorting/G002
   from g00x.flow.flow import parse_flow_data
   from g00x.data import Data

   # data
data = Data()

   # parse flow data into a dataframe
validated_structure = parse_flow_data(data,"g002/G002/sorting/G002") # this returns a dataframe

File structure

Your folder structure on Box should look like this.

# If you used the above commands to sync the data. The folder structure will look like this

g002/G002/sorting
└── G002
    ├── Prescreens
       ├── Prescreen_RunDate220825_UploadDate221021
       ├── Prescreen_RunDate220826_UploadDate221021
    ....
    └── Sorts
        ├── Sort_RunDate220927_UploadDate221013
        ├── Sort_RunDate220928_UploadDate221014
        ├── Sort_RunDate220929_UploadDate221014
        ├── Sort_RunDate220930_UploadDate221014
    ...

Full file structure

Download full architecture via Madhu Prabhakaran

Sequencing files

To validate the sequencing files, they must be merged with the flow data. Thus, you will need both sequencing and flow files. This is because most of the metadata is housed in the flow files.

The folder for sequencing should look like the following.

G002
├── run0002
│   ├── 221006_VH00497_31_AAAVKCLHV
│   └── sample_manifest.csv
├── run0003
│   ├── 221019_VH00497_32_AAANGGVM5
│   └── sample_manifest.csv
└── run0004
 ├── 221101_VL00414_3_AACFYLCM5
 ├── 221103_VL00414_4_AAATJGYM5
 └── sample_manifest.csv

where a sample_manifest.csv looks like the following.

pool_number sorted_date vdj_sequencing_replicate cso_sequencing_replicate vdj_lirary_replicate cso_library_replicate bio_replicate vdj_index feature_index vdj_run_id cso_run_id
0 1 220927 0 0 0 0 0 SI-TT-D6 SI-TN-D6 221006_VH00497_31_AAAVKCLHV 221006_VH00497_31_AAAVKCLHV
1 1 220928 0 0 0 0 0 SI-TT-E6 SI-TN-E6 221006_VH00497_31_AAAVKCLHV 221006_VH00497_31_AAAVKCLHV
2 1 220929 0 0 0 0 0 SI-TT-F6 SI-TN-F6 221006_VH00497_31_AAAVKCLHV 221006_VH00497_31_AAAVKCLHV
3 1 220930 0 0 0 0 0 SI-TT-G6 SI-TN-G6 221006_VH00497_31_AAAVKCLHV 221006_VH00497_31_AAAVKCLHV
4 2 220930 0 0 0 0 0 SI-TT-H6 SI-TN-H6 221006_VH00497_31_AAAVKCLHV 221006_VH00497_31_AAAVKCLHV

To run the merge use the following.

$ g00x g002 validate merge -f /path/to/flow -s /path/to/sequencing -o merge.csv
---> 100%

You can use the following command to merge the files if you have the AWS Structure.

$ g00x g002 validate merge -s ./g002/G002/sequencing/G002 -f ./g002/G002/sorting/G002 -o merge.csv
---> 100%
   from g00x.sequencing.merge import merge_flow_and_sequencing
   from g00x.data import Data

out = 'merge.csv'
data = Data()
df = merge_flow_and_sequencing(data,"path/to/flow", "path/to/sequencing")
df.to_csv(out)

The merged file will use the sort pool and sorting date as the fields to join together the sequencing and flow data.

merge

Merged Fields

Below are all the fields available in the merged file. You may view the merged file through the api my_file = merge_flow_and_sequencing(data,"path/to/flow", "path/to/sequencing") and then my_file.data to view the data.

Column Definition
ptid The participant id, e.g. G002XXXX
group The trial group, 1-4
weeks The number of weeks post-vaccine
visit_id The visit id which corresponds to a weeks post vaccine
probe_set Which probe was the sample sorted with, eOD or Core
sample_type PBMC or GC
run_date :material-merge: The date of the sort
sort_pool :material-merge: Which pool was the sample put into
hashtag The hashtag oligo which will be used to demultiplex
run_dir_path Where the sequencing data path is located
pool_number :material-merge: The pool number is the same as the sort pool
sorted_date :material-merge: The date of the sort
vdj_sequencing_replicate How many times has the sample been sequenced for VDJ, index starts at zero
cso_sequencing_replicate How many times has the sample been sequenced for the feature library
vdj_lirary_replicate How many times has the same library been prepared for the VDJ
cso_library_replicate How many times has the same library been prepared for the feature
bio_replicate :material-bio: A biological replicate (brand new samples)
vdj_index The Illumina index to demultiplex the VDJ library
feature_index The Illumina index to demultiplex the feature library
vdj_run_id The Illumina assigned run id for the VDJ library
cso_run_id The Illumina assigned run id for the feature library

An example of five entries for the merged file is below.

ptid group weeks visit_id probe_set sample_type run_date sort_pool hashtag run_dir_path pool_number sorted_date vdj_sequencing_replicate cso_sequencing_replicate vdj_lirary_replicate cso_library_replicate bio_replicate vdj_index feature_index vdj_run_id cso_run_id
0 G002516 1 -5 V091 eODGT8 PBMC 2022-09-27 P01 HT01 /mnt/fsx/workspace/jwillis/repos/G00x/g002/G002/sequencing/G002/run0002 P01 2022-09-27 0 0 0 0 0 SI-TT-D6 SI-TN-D6 221006_VH00497_31_AAAVKCLHV 221006_VH00497_31_AAAVKCLHV
1 G002516 1 4 V160 eODGT8 PBMC 2022-09-27 P01 HT02 /mnt/fsx/workspace/jwillis/repos/G00x/g002/G002/sequencing/G002/run0002 P01 2022-09-27 0 0 0 0 0 SI-TT-D6 SI-TN-D6 221006_VH00497_31_AAAVKCLHV 221006_VH00497_31_AAAVKCLHV
2 G002516 1 8 V200 eODGT8 PBMC 2022-09-27 P01 HT03 /mnt/fsx/workspace/jwillis/repos/G00x/g002/G002/sequencing/G002/run0002 P01 2022-09-27 0 0 0 0 0 SI-TT-D6 SI-TN-D6 221006_VH00497_31_AAAVKCLHV 221006_VH00497_31_AAAVKCLHV
3 G002855 2 -5 V091 eODGT8 PBMC 2022-09-28 P01 HT06 /mnt/fsx/workspace/jwillis/repos/G00x/g002/G002/sequencing/G002/run0002 P01 2022-09-28 0 0 0 0 0 SI-TT-E6 SI-TN-E6 221006_VH00497_31_AAAVKCLHV 221006_VH00497_31_AAAVKCLHV
4 G002855 2 4 V160 eODGT8 PBMC 2022-09-28 P01 HT07 /mnt/fsx/workspace/jwillis/repos/G00x/g002/G002/sequencing/G002/run0002 P01 2022-09-28 0 0 0 0 0 SI-TT-E6 SI-TN-E6 221006_VH00497_31_AAAVKCLHV 221006_VH00497_31_AAAVKCLHV

A downloadable merge file can be found here