Learn sample subannotations in peppy

This vignette will show you how and why to use the subsample table functionality of the pepr package.

Problem/Goal

This series of examples below demonstrates how and why to use sample subannoatation functionality in multiple cases to provide multiple input files of the same type for a single sample.

Solutions

Example 1: basic sample subannotation table

This example demonstrates how the sample subannotation functionality is used. In this example, 2 samples have multiple input files that need merging (frog_1 and frog_2), while 1 sample (frog_3) does not. Therefore, frog_3 specifies its file in the sample_table.csv file, while the others leave that field blank and instead specify several files in the subsample_table.csv file.

This example is made up of these components:

  • Project config file:
examples_dir = "../tests/data/example_peps-cfg2/example_subtable1/"
project_config = examples_dir + "project_config.yaml"
%cat $project_config
pep_version: "2.0.0"
sample_table: sample_table.csv
subsample_table: subsample_table.csv
output_dir: $HOME/example_results

  • Sample table:
sample_table = examples_dir + "sample_table.csv"
%cat $sample_table | column -t -s, | cat
sample_name  protocol       file
frog_1       anySampleType  multi
frog_2       anySampleType  multi
frog_3       anySampleType  multi

  • Subsample table:
subsample_table = examples_dir + "subsample_table.csv"
%cat $subsample_table | column -t -s, | cat
column: line too long
sample_name  subsample_name  file
frog_1       sub_a           data/frog1a_data.txt
frog_1       sub_b           data/frog1b_data.txt
frog_1       sub_c           data/frog1c_data.txt
frog_2       sub_a           data/frog2a_data.txt

Let's load the project config, create the Project object and see if multiple files are present

from peppy import Project
p = Project(project_config)
samples = p.sample_table
samples
file protocol sample_name subsample_name
sample_name
frog_1 [data/frog1a_data.txt, data/frog1b_data.txt, d... anySampleType frog_1 [sub_a, sub_b, sub_c]
frog_2 [data/frog2a_data.txt, data/frog2b_data.txt] anySampleType frog_2 [sub_a, sub_b]
frog_3 multi anySampleType frog_3 NaN

Example 2: subannotations and derived attributes

This example uses a subsample_table.csv file and a derived attributes to point to files. This is a rather complex example. Notice we must include the file_id column in the sample_table.csv file, and leave it blank; this is then populated by just some of the samples (frog_1 and frog_2) in the subsample_table.csv, but is left empty for the samples that are not merged.

This example is made up of these components:

  • Project config file:
examples_dir = "../tests/data/example_peps-cfg2/example_subtable2/"
project_config = examples_dir + "project_config.yaml"
%cat $project_config
pep_version: "2.0.0"
sample_table: sample_table.csv
subsample_table: subsample_table.csv
output_dir: $HOME/hello_looper_results
pipeline_interfaces: [../pipeline/pipeline_interface.yaml]

sample_modifiers:
  derive:
    attributes: [file]
    sources:
      local_files: "../data/{identifier}{file_id}_data.txt"
      local_files_unmerged: "../data/{identifier}_data.txt"
  • Sample table:
sample_table = examples_dir + "sample_table.csv"
%cat $sample_table | column -t -s, | cat
column: line too long
sample_name  protocol       identifier  file
frog_1       anySampleType  frog1       local_files
frog_2       anySampleType  frog2       local_files
frog_3       anySampleType  frog3       local_files_unmerged

  • Subsample table:
subsample_table = examples_dir + "subsample_table.csv"
%cat $subsample_table | column -t -s, | cat
column: line too long
sample_name  file_id  subsample_name
frog_1       a        a
frog_1       b        b
frog_1       c        c
frog_2       a        a

Let's load the project config, create the Project object and see if multiple files are present

p = Project(project_config)
samples = p.sample_table
samples
file file_id identifier protocol sample_name subsample_name
sample_name
frog_1 [../data/frog1a_data.txt, ../data/frog1b_data.... [a, b, c] frog1 anySampleType frog_1 [a, b, c]
frog_2 [../data/frog2a_data.txt, ../data/frog2b_data.... [a, b] frog2 anySampleType frog_2 [a, b]
frog_3 ../data/frog3_data.txt NaN frog3 anySampleType frog_3 NaN
frog_4 ../data/frog4_data.txt NaN frog4 anySampleType frog_4 NaN

Example 3: subannotations and expansion characters

This example gives the exact same results as Example 2, but in this case, uses a wildcard for frog_2 instead of including it in the subsample_table.csv file. Since we can't use a wildcard and a subannotation for the same sample, this necessitates specifying a second data source class (local_files_unmerged) that uses an asterisk (*). The outcome is the same (file columns match).

This example is made up of these components:

  • Project config file:
examples_dir = "../tests/data/example_peps-cfg2/example_subtable3/"
# need to cd to the example dir so that the glob works as expected
%cd $examples_dir 
project_config = "project_config.yaml"
%cat $project_config
/Users/mstolarczyk/Uczelnia/UVA/code/peppy/tests/data/example_peps-cfg2/example_subtable3
pep_version: "2.0.0"
sample_table: sample_table.csv
subsample_table: subsample_table.csv
output_dir: $HOME/hello_looper_results
pipeline_interfaces: [../pipeline/pipeline_interface.yaml]

sample_modifiers:
  derive:
    attributes: [file]
    sources:
      local_files: "../data/{identifier}{file_id}_data.txt"
      local_files_unmerged: "../data/{identifier}*_data.txt"


  • Sample table:
%cat sample_table.csv | column -t -s, | cat
sample_name  protocol       identifier  file                  file_id
frog_1       anySampleType  frog1       local_files
frog_2       anySampleType  frog2       local_files_unmerged
frog_3       anySampleType  frog3       local_files_unmerged
frog_4       anySampleType  frog4       local_files_unmerged

  • Subsample table:
%cat subsample_table.csv | column -t -s, | cat
sample_name  file_id
frog_1       a
frog_1       b
frog_1       c

Let's load the project config, create the Project object and see if multiple files are present

p = Project(project_config)
samples = p.sample_table
samples
file file_id identifier protocol sample_name subsample_name
sample_name
frog_1 [../data/frog1a_data.txt, ../data/frog1b_data.... [a, b, c] frog1 anySampleType frog_1 [0, 1, 2]
frog_2 [../data/frog2_data.txt, ../data/frog2a_data.t... NaN frog2 anySampleType frog_2 NaN
frog_3 ../data/frog3_data.txt NaN frog3 anySampleType frog_3 NaN
frog_4 ../data/frog4_data.txt NaN frog4 anySampleType frog_4 NaN