badirater
badirater.Rmd
badirater
?badirater
is an R package developed to make genomic studies easier using family turnover rate estimations with BadiRate
. It enables genome wide studies along phylogenetic trees from within R, including data preparation, help for execution and downstream processing.
BadiRate
is a versatile program which enables family turnover rate analysis, including gene family expansion and collapsion detections, however it’s structure and style makes it difficult to work with in large-scale studies.
Standard method includes at least one run per family/orthogroup (og). This will be expanded by comparison of different branch models and technical replicates. Formerly is necessary to infer model estimation accuracy, later is to avoid local maxima detection instead of global one. If we consider only 3 replications with 3 different models on ~10.000 gene families, it is clear that we should run BadiRate ~90.000 times. This can be a reason for the inexperienced users to leave the program and look for similar, but less versatile ones.
badirater
enables to automate the whole process, including data preprocessing, batch run preparation, data collection and tidying and downstream calculations and processes.
badirater
uses reguler UNIX features, e.g. awk
, hence it is required to be used on UNIX based systems. Since BadiRate is a perl program and we expect batch processing, it is recommended to process on an available cluster computer with Torque/PBS or SLURM.
There are just two input files are required:
orthofinder
(Orthogroups.GeneCount.csv) or a similarly formatted file.Also you have to consider the branch models you would like to test. Regularly we run at least three different models: a global rate, a free rate and a species-specific rate model:
First let’s arrange our working directory as follows:
- badirate_analysis/
|- Orthogroups.GeneCount.csv
|- SpeciesTree.Ultrametric.tree
Since we would like to run BadiRate by gene family/orthogroup, we should split the files into individual components. All the new files will be stored in one directory.
library(badirater)
setwd("../test/")
split_orthogroups("../test/obp_sub.12sp.tsv", max_count = 40, og_path = "../test/badirate_orthogroups")
This created the following:
- badirate_analysis/
|- badirate_orthogroups/
|- OGxxxxxx.txt
|- ...
|- OGyyyyyy.txt
|- Orthogroups.GeneCount.csv
|- SpeciesTree.Ultrametric.tree
Now we have all the family files prepared, we can set up the experiment for BadiRate. Create the branch models object you would like to select, and set up the script files.
perl BadiRate.pl –treefile examples/droso.6sp.tamura.nwk –sizefile examples/obp_all.12sp.tsv –anc
bd_path = "/Users/gergo/Downloads/badirate-1.35/"
branch_numbers(tree = "../test/droso.12sp.tamura.nwk",
og = "../test/obp_sub.12sp.tsv",
badirate_path = bd_path,
plot = TRUE,
tree_file = "../test/branch_ids.tree")
Now we see the branch numbering and can describe branch models. Important, that each branch model should have an exactly 2-letter long unique name. In this case, we prepare 4 models, one global rate and one free rate model and 2 branch specific ones (sp
and fm
), both with one group and background.
branch_models = c(
"gr" = "GR",
"sp" = "22->21"
)
As we decided our models and every files are ready, we can prepare the running scripts. Don’t forget to save the output of the prepare_badirate()
for future use. If you forgot, still you can rerun with create_scripts = "none"
option.
setup_table <- prepare_badirate(og_path = "./badirate_orthogroups/",
tree = "./droso.12sp.tamura.nwk",
branch_models = branch_models,
rate_model = "GD",
out_dir = "./raw_outputs",
script_dir = "./scripts",
replicates = 2,
ancestral = TRUE,
outlier = TRUE,
seed = 20180808,
start_value = 1,
pbs_q = "small",
badirate_path = bd_path,
create_scripts = "pbs")
setup_table
Don’t forget to save the setup_table
object!
readr::write_csv(setup_table, "./setup_table.csv")
Now our working directory should look like this:
- badirate_analysis/
|- raw_outputs/
|- scripts/
|- badi_gr_script.pbs
|- badi_fr_script.pbs
|- badi_sp_script.pbs
|- badi_fm_script.pbs
|- badirate_orthogroups/
|- OGxxxxxx.txt
|- ...
|- OGyyyyyy.txt
|- setup_table.csv
|- Orthogroups.GeneCount.csv
|- SpeciesTree.Ultrametric.tree
There are several ways you can run the created jobs in the ./scripts
directory. If your system has available PBS or SLURM, you could create ready-to-run scripts with preapre_badirate()
, otherwise you can use those scripts as a starting point and run in your command line directly 1.
In badirater
there are some built in function to be able to submit and monitor PBS and SLURM jobs without leaving R.
After all the jobs finished (may take a while), it is time to collect the important data from each files in the ./raw_outputs
directory.
- badirate_analysis/
|- raw_outputs/
|- gr/
|- OGxxxxxx.txt.gr01.bd
|- ...
|- fr/
|- OGxxxxxx.txt.fr01.bd
|- ...
|- ...
|- scripts/
|- badi_gr_script.pbs
|- badi_fr_script.pbs
|- badi_sp_script.pbs
|- badi_fm_script.pbs
|- badirate_orthogroups/
|- OGxxxxxx.txt
|- ...
|- OGyyyyyy.txt
|- setup_table.csv
|- Orthogroups.GeneCount.csv
|- SpeciesTree.Ultrametric.tree
Use the bd_collect()
function for automatic collection. It may take a while, depending on the volume of the experiment.
This will collect 4 files in each branch model - replication group:
nnXX.branch_code.txt
: Branch decodingsnnXX.likelihood.txt
: Likelihood values for each families/orthogroupsnnXX.parameters.txt
: Calculated turnover ratesnnXX.gains.txt
: Minimum gain and loss information per branchsetup_table <- readr::read_csv("../test/setup_table.csv")
bd_collect(setup_table, out_dir = "../test/results")
- badirate_analysis/
|- results/
|- grXX.branch_code.txt
|- grXX.likelihood.txt
|- grXX.parameters.txt
|- grXX.gains.txt
|- ...
|- raw_outputs/
|- gr/
|- OGxxxxxx.txt.gr01.bd
|- ...
|- fr/
|- OGxxxxxx.txt.fr01.bd
|- ...
|- ...
|- scripts/
|- badi_gr_script.pbs
|- badi_fr_script.pbs
|- badi_sp_script.pbs
|- badi_fm_script.pbs
|- badirate_orthogroups/
|- OGxxxxxx.txt
|- ...
|- OGyyyyyy.txt
|- setup_table.csv
|- Orthogroups.GeneCount.csv
|- SpeciesTree.Ultrametric.tree
dir.create("./analysis")
Here we use AIC (Akaike Information Criteria) to select the best replicates per model per family and wAIC (weighted AIC) to select the best models per orthogroup and assest the model significance.
best_models <- bd_model_select(setup_table = setup_table,
results_dir = "./results")
readr::write_csv(best_models, "./analysis/best_models.csv")
best_models
The output is tibble with one line per family, with the selected best model, its AIC and wAIC number, and its wAIC ratio to the second best model (it is accepted as significant if wAIC ratio is greater then 2.7, which means log(2.7) = 0.99325 is the probability that we selected a more valid model). This is corresponding to the signif
column.
We can parse gain/loss events for each orthogroup from the best models.
gains <- bd_extract_gain(best_models = best_models, results_dir = "./results")
gains
Now we have most of the information on significant gain/loss events.
This document was generated on the following system:
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocStyle_2.4.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.18 rstudioapi_0.7 knitr_1.20 xml2_1.1.1
## [5] magrittr_1.5 roxygen2_6.1.0 MASS_7.3-47 R6_2.2.2
## [9] rlang_0.2.1 stringr_1.3.1 tools_3.4.1 htmltools_0.3.6
## [13] commonmark_1.5 yaml_2.2.0 rprojroot_1.2 digest_0.6.15
## [17] assertthat_0.2.0 pkgdown_1.1.0 crayon_1.3.4 fs_1.2.5
## [21] memoise_1.1.0 evaluate_0.11 rmarkdown_1.6 stringi_1.2.4
## [25] compiler_3.4.1 desc_1.2.0 backports_1.1.2
Too complex or too many jobs can run for days. Be careful to save everything if you leave the computer.↩