Phylobayes MPI on ACCESS

Phylobayes MPI on ACCESS 1.8c Phylogenetic reconstruction using infinite mixtures - run on XSEDE Nicolas Lartillot, Thomas Lepage, and Samuel Blanquart Nicolas Lartillot, Thomas Lepage, and Samuel Blanquart. 2009. PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating. Bioinformatics 25(17) 2286–2288. Lartillot, N., Rodrigue, N., Stubbs, D., Richer, J. 2013. PhyloBayes MPI. Phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Systematic Biology. Phylogeny / Alignment phylobayes_xsede invocation1 perl "PBMultiRun_expanse.bash" 0 dna_conf1 3 scheduler.conf perl $datatype ne "PROTEIN" && $num_sites < 2000 perl

	"jobtype=mpi\\n" .
									"workflow_type=mpi_complex\\n" .
									"nodes=1\\n" .
									"cpus-per-task=1\\n" .
									"mem=92G\\n" .
									"node_exclusive=0\\n" .
									"mpi_processes=48\\n"

dna_conf2 3 scheduler.conf perl $datatype ne "PROTEIN" && $num_sites > 1999 perl

	"jobtype=mpi\\n" .
									"workflow_type=mpi_complex\\n" .
									"nodes=1\\n" .
									"mem=123G\\n" .
									"cpus-per-task=1\\n" .
									"node_exclusive=0\\n" .
									"mpi_processes=64\\n"

prot_conf1 3 scheduler.conf perl $datatype eq "PROTEIN" && $num_sites < 1000 perl

	"jobtype=mpi\\n" .
									"workflow_type=mpi_complex\\n" .
									"cpus-per-task=1\\n" .
									"nodes=1\\n" .
									"mem=123G\\n" .
									"node_exclusive=0\\n" .
									"mpi_processes=64\\n"

prot_conf2 3 scheduler.conf perl $datatype eq "PROTEIN" && $num_sites > 999 perl

	"jobtype=mpi\\n" .
									"workflow_type=mpi_complex\\n" .
									"cpus-per-task=1\\n" .
									"mem=186G\\n" .
									"nodes=1\\n" .
									"node_exclusive=0\\n" .
									"mpi_processes=96\\n"

infile Input Data (relaxed phylip format) perl "-d infile.phy" 41 infile.phy ALL_FILES * runtime 1 scheduler.conf Maximum Hours to Run (click here for help setting this correctly) 0.5 The maximum hours to run must be less than 168 perl $runtime > 168.0 Please enter a positive number for the maximum runtime perl $runtime < 0 Please specify a maximum runtime perl !defined $runtime perl

								
									"remove_mv2_param=1\\n" .
									"runhours=$value\\n"

The job will run on 48 cores as configured. If it runs for the entire configured time, it will consume 48 x $runtime core (cpu) hours perl $datatype ne "PROTEIN" && $num_sites < 2000 The job will run on 64 cores as configured. If it runs for the entire configured time, it will consume 64 x $runtime core (cpu) hours perl $datatype ne "PROTEIN" && $num_sites > 1999 The job will run on 64 cores as configured. If it runs for the entire configured time, it will consume 64 x $runtime core (cpu) hours perl $datatype eq "PROTEIN"&& $num_sites < 1000 The job will run on 96 cores as configured. If it runs for the entire configured time, it will consume 96 x $runtime core (cpu) hours perl $datatype eq "PROTEIN" && $num_sites > 999 Estimate the maximum time your job will need to run. We recommend testimg initially with a < 0.5hr test run because Jobs set for 0.5 h or less depedendably run immediately in the "debug" queue. Once you are sure the configuration is correct, you then increase the time. The reason is that jobs > 0.5 h are submitted to the "normal" queue, where jobs configured for 1 or a few hours times may run sooner than jobs configured for the full 168 hours. num_sites Number of sites in the data set Please specify the number of sites (columns) in your data set. perl !defined $num_sites You can make a quick run and find this value in the stderr file. datatype My data set contains DNA DNA PROTEIN Amino acids unk other Please specify the type of data. perl !defined $datatype bryan_script Convergence Parameters set_seed 2 Enter seed value perl defined $set_seed ? "--seed $value":"" Base random seed to use. Each of the N chains will add its id to that. Setting a seed value allows the user to reproduce specific runs. set_chains 4 Please specify the number of separate chains to run perl "-N 2" checkinterval 6 Number of seconds to wait between checks for convergence. perl "--CHECKINTERVAL $value" 600 Please specify the number of seconds to wait between checks for convergence. perl !defined $checkinterval burninval 8 Number of cycles to exclude from convergence checks. perl "--BURNIN $value" 500 Please specify the number of cycles to exclude from convergence checks. perl !defined $burninval acceptdiffval 10 Maximum \maxdiff" value to accept. perl "--ACCEPTDIFF $value" 0.1 Please specify the Maximum "maxdiff" value to accept. The manual says 0.3 is "acceptable" and 0.1 is "very good." perl !defined $acceptdiffval acceptsizeval 12 Minimum Effective Size perl "--ACCEPTSIZE $value" 50 Please specify the minimum number of effective samples required before convergence. Default = 50 perl !defined $acceptsizeval giveupval 14 Minimum number of cycles before giving up if "maxdiff" is still 1 (Analysis is stuck) perl "--GIVEUP $value" 10000 Please specify the minimum number of cycles before giving up if "maxdiff" is still 1 (Analysis is stuck). Default = 10000 perl !defined $giveupval end_of_wrapper 40 perl "" bryan_script Convergence Parameters constant_sitesr 42 Remove Constant Sites (-dc) perl ($value) ? " -dc":"" 1 starting_tree 44 Specify a Newick Starting Tree (-tree) treefile.phy The -t (starting tree)option forces the chain to start from the specified tree. starting_tree2 44 perl defined $starting_tree perl "-t treefile.phy" The -t (starting tree)option forces the chain to start from the specified tree. constraint_tree 46 Specify a fixed topology (in the selected file)(-T) constrfile.phy This option forces the chain to run under a fixed topology (as specified in the given file). In other words, the chain only samples from the posterior distribution over all other parameters (branch lengths, alpha parameter, etc.), conditional on the specified topology. This should be a bifurcating tree (see Input Data Format section of the PB MPI manual). constraint_tree2 46 perl defined $constraint_tree perl "-T constrfile.phy" This option forces the chain to run under a fixed topology (as specified in the given file). In other words, the chain only samples from the posterior distribution over all other parameters (branch lengths, alpha parameter, etc.), conditional on the specified topology. This should be a bifurcating tree (see Input Data Format section of the PB MPI manual). saving_freq 50 Save every (-x) perl "-x $value $stop_at" 1 Please specify the saving frequency perl !$saving_freq This option specifies the saving frequency and (optional) the number of points after which the chain should stop. If this number is not specified, the chain runs \forever". By definition, -x 1 corresponds to the default saving frequency. In some cases, samples may be strongly correlated, in which case, if disk space or access is limiting, it would make sense to save points less frequently, say 10 times less often: to do this, you can use the -x 10 option.. stop_at 51 Number of points after which the chain should stop (-1 = forever) -1 Please specify the number of points after which a chain should stop perl !$stop_at This option specifies the number of points after which the chain should stop. If this number is not specified, the chain runs \forever". save_detmodelconf 54 Save detailed model configuration for each point(-s) perl ($value) ? "-s":"-S" Using the -s option may cause very large output files to be produced. perl $save_detmodelconf By default, the sampler only saves the trees explored during MCMC in the treelist file (and the summary statistics in the trace file). This is enough for computing the consensus tree but insufficient for estimating the continuous parameters of the model (e.g. site-specific equilibrium frequency profiles) or for conducting posterior predictive tests or cross-validation analyses. For this, you should save the detailed model configuration for each point visited during the run using this -s option. evolutionary_models Evolutionary Models num_gammacats 58 Number of Categories for the discrete gamma distribution (-dgam) perl (defined $value) ? "-dgam $value":"" This specifies n categories for the discrete gamma distribution. Setting n = 1 amounts to a model without across-site variation in substitution rate exchange_rates 64 Exchange Rates perl "$value" -gtr GTR -poisson Poisson -lg LG -wag WAG -jtt JTT -mtrev MTREV -mtzoa MTZOA -mtart MTART -rr Custom -poisson This specifies n categories for the discrete gamma distribution. Setting n = 1 amounts to a model without across-site variation in substitution rate custom_exch_file Select the Custom Exchangabilities File perl $exchange_rates eq "-rr" 65 perl "exchange_rate.txt" exchange_rate.txt Please select a Custom Exchange Rate File perl $exchange_rates eq "-rr" && !defined $custom_exch_file This option allows a custom exchange rate file. Exchangabilities are fixed to the values given in the specified file. The file should be formatted as follows: [ALPHABET] rr1_2 rr1_3 ... rr1_20 rr2_3 rr2_4 ... rr2_20 ... rr18_19 rr18_20 rr19_20 You have to specify the order in which amino acids should be considered on the first line ([ALPHABET]), with letters separated by spaces or tabs. This header should then be followed by the exchangeabilities in the order specified (spaces, tabs or returns are equivalent: only the order matters). profile_mixture 60 Profile Mixture -dp CAT (Dirichlet Process) ncatn Mixture of N components catfix Pre-defined profiles custom Custom Profile -dp "-cat" ncatn "-ncat $categories" catfix "-catfix $predefined_profiles" custom "catfix userprofiles.txt" -dp Fixing the number of components of the mixture most often results in a poor mixing of the MCMC perl $profile_mixture eq "ncatn" -dp (or -cat) activates the Dirichlet process. Mixture of N Componenets (-ncat n) specifies a mixture of n components; the number of components is fixed whereas the weights and profiles are treated as random variables. Fixing the number of components of the mixture most often results in a poor mixing of the MCMC. The Dirichlet process usually has a much better mixing behavior. categories Number of Fixed Components (-ncat N) perl $profile_mixture eq "ncatn" Please enter the numnber of fixed components perl $profile_mixture eq "ncatn" && !defined $categories This specifies n categories for the discrete gamma distribution. Setting n = 1 amounts to a model without across-site variation in substitution rate predefined_profiles Choose a Predefined Profile perl $profile_mixture eq "catfix" WLSR5 WLSR5 C20 C20 C30 C30 C40 C40 C50 C50 C60 C60 Please select a pre-defined profile perl $profile_mixture eq "catfix" && !defined $value The selected value specifies a mixture of a set of pre-defined profiles (the weights are reestimated). predef can be either one of the following keywords: C20, C30, C40, C50, C60, which correspond to empirical profile mixture models (Quang et al., 2008); or WLSR5, which correspond to the model of Wang et al. (2008). Note that this latter model actually defines 4 empirical profiles, which are then combined with a fifth component made of the empirical frequencies of the dataset custom_profile Select the Profile of Equilibrium Frequencies File perl $profile_mixture eq "custom" userprofiles.txt Please select a Profile of Equilibrium Frequencies File perl $profile_mixture eq "custom" mutsel_model 70 Use Codon Alignments Only perl ($value)? "-mutsel":"" 0 This option activates the mutation-selection model as described in Rodrigue et al. (2010) (codon align- ments only. mtvert_codons 72 Use Vertebrate Mitochondrial Genetic Code perl ($value)? "-mtvert":"" 0 This option activates the mutation-selection model as described in Rodrigue et al. (2010) (codon align-ments only.