Introduction to seeker

The main function in the seeker package is, well, seeker(). Currently seeker() is targeted at processing RNA-seq data. The main input is a list of parameters specifying which steps of RNA-seq data processing to perform and how to perform them. Depending on the parameters, seeker() will call other functions in the package.

A convenient way to construct the list of parameters is to make a yaml file and read it into R using yaml::read_yaml(). A template yaml file is reproduced below and available at system.file('extdata', 'params_template.yml', package = 'seeker').

study: '' # [string]
metadata:
  run: TRUE # [logical]
  bioproject: '' # [string]
  # include # [named list or NULL]
    # colname # [string]
    # values # [vector]
  # exclude # [named list or NULL]
    # colname # [string]
    # values # [vector]
fetch:
  run: TRUE # [logical]
  # overwrite # [logical or NULL]
  # ascpCmd # [string or NULL]
  # ascpArgs # [character vector or NULL]
  # ascpPrefix # [string or NULL]
trimgalore:
  run: TRUE # [logical]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
fastqc:
  run: TRUE # [logical]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
salmon:
  run: TRUE # [logical]
  indexDir: '' # [string]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
multiqc:
  run: TRUE # [logical]
  # cmd # [string or NULL]
  # args # [character vector or NULL]
tximport:
  run: TRUE # [logical]
  tx2gene:
    # [named list or NULL]
    dataset: 'mmusculus_gene_ensembl' # [string]
    version: 104 # [number; latest version is 104 as of Oct 2021]
  countsFromAbundance: '' # [string]
  # ignoreTxVersion # [logical or NULL]

A convenient way to run seeker() is then using a script such as the one reproduced below and available at system.file('extdata', 'run_seeker.R', package = 'seeker').

doParallel::registerDoParallel()

cArgs = commandArgs(TRUE)
yamlPath = cArgs[1L]
parentDir = cArgs[2L]

params = yaml::read_yaml(yamlPath)
seeker::seeker(params, parentDir)

If you copy the script to your current working directory, you can run it using something like

Rscript run_seeker.R <path/to/study>.yml <path/to/parent/directory>

A fancier option, which saves stdout and stderr to a log file, would be something like

study="<study>" && \
parentDir="<path/to/parent/directory>" && \
mkdir -p "${parentDir}/${study}" && \
Rscript run_seeker.R "<path/to>/${study}.yml" "${parentDir}" &> \
  "${parentDir}/${study}/progress.log"

This option assumes that the name of the yaml file (minus the file extension) is identical to the study variable within the yaml file, which we highly recommend.

Jake Hughey

2021-10-27