<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pise PUBLIC "pise2.dtd" "pise2.dtd" >
<pise>
	<head>
		<title>Noisy on XSEDE</title>
		<version>1.5.12</version>
		<description>Identify homo-plastic characters in multiple sequence alignments - run on XSEDE</description>
		<authors>Christoph Flamm, Sonja J Prohaska, Guido Fritzsch, Peter F Stadler</authors>
		<reference>
			Andreas W. M.Dress, Christoph Flamm, Guido Fritzsch, Stefan Grünewald, Matthias Kruspe, Sonja J. Prohaska, Peter F. Stadler Noisy: identification of problematic columns in multiple sequence alignments. Algorithms Mol Biol, 3:7 (2008). doi:10.1186/1748-7188-3-7
		</reference>
		<reference>
Stefan Grünewald, Kristoffer Forslund, Andreas W. M.Dress, Vincent Moulton QNet: An Agglomerative Method for the Construction of Phylogenetic Networks from Weighted Quartets. Mol Biol Evol, 24(2):532-538 (2007). doi:10.1093/molbev/msl180
		</reference>
		<reference>
Bryant, David and Moulton, Vincent (2004) Neighbor-Net: An Agglomerative Method for the Construction of Phylogenetic Networks. Mol. Biol. Evol. 21:255-265
		</reference>
		<category>Phylogeny / Alignment</category>
	</head>
	
	<command>noisy_xsede</command>

<parameters>
<!-- original version by mamiller, around 5/30/2020 -->
<!-- 

SYNOPSIS
noisy [-*cutoff FLOAT] [-*distance STRING] [-*help] [-*matrix FILE] [-*missing STRING] [-*nogap] [-*noconstant] [-*ordering STRING] [-*reorder] [-*shuffles INT] [-*silent] [-*smooth INT] [-*seqtype CHAR] [-*verbose]

 
DESCRIPTION
noisy 
In a first phase, the rows of the input multiple sequence alignment (MSA) in multi fasta format are
reordered to conform to a circular ordering. For this purpose noisy includes the corresponding subset of
routines from David Bryant and Vincent Moulton's NeighborNet and Stefan Gruenewald's QNet packages. 

Subsequently, a reliability score for each column of the reordered MSA is calculated. Essentially, the 
number of character state alterations in an alignment column is counted and compared to the observed 
count in random shufflings of the column. The uniform pseudo-random number generator Mersenne Twister 
is used to generate the random shufflings of alignment columns.

noisy exports a PostScript file, visualizing the quality of the columns of the reordered input MSA, the 
reliability score of all columns of the reordered input MSA as xy-data and a modified alignment in which
columns with a reliability smaller then a cutoff value (set via option -*cutoff) are removed. The 
program noisy is written in ISO C++. The source code is available from

http://www.bioinf.uni-leipzig.de/Software/noisy/.  

OPTIONS

-*cutoff FLOAT
    Set the lower bound of the reliability score for an alignment column to FLOAT. Columns with a score below FLOAT are removed from the output alignment. The name of the output MSA is constructed from the base name of the input MSA by adding the post fix _out.fas 
-*distance HAMMING|GTR
    Set distance calculation of NeighborNet to HAMMING or GTR 
-h, -*help
    Display usage information. 
-*matrix FILE
    Read distance matrix used by NeighborNet to generate the cyclic order from FILE instead of letting NeighborNet calculating the distance matrix by one of the methods given to option -*distance. 
-*missing STRING
    Each character of STRING is treated as missing data, and is removed a column before before changes between character states are calculated. 
-*nogap
    Add the gap symbol to the set of missing characters. 
-*noconstant
    Suppress constant columns in the output MSA. 
-*ordering nnet|qnet|rand[,INT]|all|INT(,INT)*
    Set the method to calculate the cyclic order to one of the two major methods NeighborNet which is the default or QNet.

    With rand a random sample of all possible orderings of the TAXA can be specified for which the 
    reliability score is calculated. The size of the random sample (default is 1000) can be set by 
    adding an integer after a comma to rand i.e. rand,42. (All orderings with a smaller reliability 
    than cutoff are singled out to a text file with "_best.gr" as post fix)

    If all is used than for all possible permutations of the TAXA the reliability score is calculated 
    (Note that for more than 8 TAXA this can become rather time consuming!).

    Keep in mind that the qnet algorithm is O(n^4) both in time and memory requirements where n is the 
    number of taxa in the input alignment. This limits the number of taxa to around 120 for all 
    practical purposes. (Note: the current implemented maximum number of taxa is 338 which requires 
    about 30GB of memory!)

    Finally a particular cyclic ordering can be specified by a comma-separated list of TAXA indices in 
    the range [0, NumberOfTAXA[ (no spaces are allowed) e.g 3,0,4,1,2 as ordering for the 5 TAXA in the
    input MSA. 
    
-r, -*reorder
    Reorder MSA only. No calculation of the reliability score is calculated. The reordered MSA is printed to stdout. 
-*shuffles INT
    Perform INT random shufflings per column of the MSA. 
-s, -*silent
    Suppress the printing of progress information to stderr. 
-*smooth INT
    Calculate a running average over the reliability score of INT columns and use this smoothed values to remove unreliable columns from the MAS. 
-*seqtype D|P|R
    Set sequence type of input MSA to DNA which is the default Protein or RNA. This information is used by NeighborNet during distance matrix calculation. 
-v, -*verbose
    Increase the verbosity level.

 
REFERENCES

Matsumoto, Makoto (1998) Mersenne Twister: {A} 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans. on Modeling and Computer Simulation 8(1):3-30

Bryant, David and Moulton, Vincent (2004) Neighbor-Net: An Agglomerative Method for the Construction of Phylogenetic Networks. Mol. Biol. Evol. 21:255-265

Gruenewald Stefan and Forslund Kristoffer and Dress Andreas WM and Moulton Vincent (2007) QNet: an agglomerative method for the construction of phylogenetic networks from weighted quartets. Mol Biol Evol, 24:532-538.

If you use this program in your work you might want to cite:

Dress, Andreas WM and Flamm, Christoph and Fritzsch, Guido and Gruenewald, Stefan and Kruspe, Matthias and Prohaska, Sonja J and Stadler Peter F (2008) Identification of Homoplastic Characters in Multiple Sequence Alignments. Alg Mol Biol, 3:7  
VERSION
This man page documents version 1.5.12 of noisy.  
AUTHORS
  

 -->
	
<!--  submission on trestles: the invocation line and any needed thread specification  -->
					<parameter ismandatory="1" ishidden="1" type="String">
						<name>noisy_comet</name>
						<attributes>
							<format>
								<language>perl</language>		
								<code>"<![CDATA[/expanse/projects/ngbt/opt/expanse/noisy/Noisy-1.5.12/bin/noisy]]>"</code>
							</format>
							<group>0</group>
						</attributes>
					</parameter>							
					
<!-- this section defines the file scheduler.conf that accompanies the command line to the TG resource. It instructs the machine how to run the job.  --> 
	
<!-- 1 shared node and 1 threads for all jobs  -->									
				<parameter type="String" ishidden="1" >
					<name>number_nodes</name>
					<attributes>
						<group>2</group>
						<paramfile>scheduler.conf</paramfile>
						<format>
							<language>perl</language>
							<code>
				 "threads_per_process=1\\n" .
				 "node_exclusive=0\\n" .
				 "mem=15G\\n" .
				 "nodes=1\\n"
							</code>
						</format>
					</attributes>
				</parameter>
<!-- end number of nodes  -->	

<!-- 1 shared node and 1 thread  									
				<parameter type="String" ishidden="1" >
					<name>number_nodes2</name>
					<attributes>
						<precond>
							<language>perl</language>
							<code>$more_memory</code>
						</precond>
						<group>2</group>
						<paramfile>scheduler.conf</paramfile>
						<format>
							<language>perl</language>
							<code>
									"nodes=1\\n" .
									"node_exclusive=1\n" .
									"threads_per_process=1\\n"
							</code>
						</format>
					</attributes>
				</parameter>-->
<!-- end number of nodes  -->	

<!-- input file specification -->
<!-- the input file to be operated on ends the command line -->
		<parameter issimple="1" ismandatory="1" isinput="1" type="Sequence">
			<name>infile</name>
			<attributes>
				<prompt>Input File (AFA format)</prompt>
				<format>
					<language>perl</language>
					<code>"input.afa"</code>
				</format>
				<group>3</group>
<!-- this file designator seems to come at the end of the command string, so we set if for 99 currently -->
				<filenames>input.afa</filenames>
			</attributes>
		</parameter>
	
		<parameter ishidden="1" type="Results">
			<name>all_results</name>
			<attributes>
				<filenames>*</filenames>
			</attributes>
		</parameter>
		
<!-- This section provides visible queries that help configure the interface  -->

<!-- this sets the run time -->
				<parameter type="Float" issimple="1" ismandatory="1">
					<name>runtime</name> 
					<attributes>
						<group>1</group>
						<prompt>Maximum Hours to Run (up to 168 hours)</prompt>
						<paramfile>scheduler.conf</paramfile>
						<vdef>
							<value>0.5</value>
						</vdef>
						<ctrls>
							<ctrl>
								<message>The maximum hours to run must be less than 168</message>
								<language>perl</language>
								<code>$runtime &gt; 168.0</code>
							</ctrl>
							<ctrl>
								<message>The maximum hours to run must be greater than 0.05</message>
								<language>perl</language>
								<code>$runtime &lt; 0.05</code>
							</ctrl>
						</ctrls>
						<format>
							<language>perl</language>
							<code>"runhours=$value\\n"</code>
						</format>
						<!-- provide feedback on number of cpu hrs to be consumed; all runs are the same, but this must be keyed to a visible param, so here we make it conditional on a non-zero run time. -->
						<warns>
						 <warn>
								<message>The job will run on 1 processor as configured. If it runs for the entire configured time, it will consume $runtime cpu hours</message>
								<language>perl</language>
								<code>$runtime ne 0 </code>
							</warn>
						</warns>
						<comment>
<value>Estimate the maximum time your job will need to run. We recommend testing initially with a &lt; 0.5hr test run because Jobs set for 0.5 h or less depedendably run immediately in the "debug" queue. 
Once you are sure the configuration is correct, you then increase the time. The reason is that jobs &gt; 0.5 h are submitted to the "normal" queue, where jobs configured for 1 or a few hours times may
	run sooner than jobs configured for the full 168 hours. 
</value>
						</comment>
					</attributes>
				</parameter>

<!-- -*cutoff FLOAT
    Set the lower bound of the reliability score for an alignment column to FLOAT. Columns with a score below FLOAT are removed from the output alignment. The name of the output MSA is constructed from the base name of the input MSA by adding the post fix _out.fas 
 -->

				<parameter issimple="0"  type="Float">
					<name>specify_cutoff</name>
					<attributes>
						<group>5</group>
						<prompt>Set the lower bound of the reliability score</prompt>
						<format>
							<language>perl</language>
							<code>defined $value ? "--cutoff $value":""</code>
						</format>
						<comment>
							<value>Set the lower bound of the reliability score for an alignment column to FLOAT. Columns with a score below FLOAT are removed from the output alignment. The name of the output MSA is constructed from the base name of the input MSA by adding the post fix _out.fas </value>
						</comment>
					</attributes>
				</parameter>	
			
<!-- -*distance HAMMING|GTR   Set distance calculation of NeighborNet to HAMMING or GTR -->
			<parameter type="Excl">
					<name>specify_distcalc</name>
					<attributes>
						<prompt>Set distance calculation of NeighborNet (--distance)</prompt>
						<vlist>
							<value>HAMMING</value>
							<label>HAMMING</label>
							<value>GTR</value>
							<label>GTR</label>
						</vlist>
						<vdef>
							<value>HAMMING</value>
						</vdef>
						<format>
							<language>perl</language>
							<code>" --distance $value "</code>
						</format>
						<group>6</group>
						<comment>
							<value>Set distance calculation of NeighborNet to HAMMING or GTR</value>
						</comment>
					</attributes>
				</parameter>

<!-- -*matrix FILE
    Read distance matrix used by NeighborNet to generate the cyclic order from 
    FILE instead of letting NeighborNet calculate the distance matrix by one 
    of the methods given to option -*distance.  -->				
				<parameter type="InFile">
					<name>distance_matrix</name>
					<attributes>
						<prompt>Select Substitution matrix file for Neighbornet (--matrix)</prompt>
						<precond>
							<language>perl</language>
							<code>!defined $specify_distcalc</code>
						</precond>
						<format>
							<language>perl</language>
							<code> defined $value ? " --matrix distance_matrix.txt":""</code>
						</format>
						<group>7</group>
						<ctrls>
							<ctrl>
								<message>The matrix file is not compatible with using the --distance option</message>
								<language>perl</language>
								<code>defined $distance_matrix &amp;&amp; defined $distance</code>
							</ctrl>
						</ctrls>
						<filenames>distance_matrix.txt</filenames>
						<comment>
						<value>Read distance matrix used by NeighborNet to generate the cyclic order from 
    FILE instead of letting NeighborNet calculate the distance matrix by one 
    of the methods given to option -*distance
						</value>
						</comment>
					</attributes>
				</parameter>

<!-- -*missing STRING
    Each character of STRING is treated as missing data, and is removed a column before before changes between character states are calculated. -->
				<parameter issimple="0"  type="String">
					<name>specify_string</name>
					<attributes>
						<prompt>Treat this character(s) as missing data (--missing)</prompt>
						<format>
							<language>perl</language>
							<code>defined $value ? "--missing $value":""</code>
						</format>
						<vdef>
							<value>N</value>
						</vdef>
						<group>8</group>
						<comment>
							<value>Each character of STRING is treated as missing data, and is removed a column before before changes between character states are calculated.</value>
						</comment>
					</attributes>
				</parameter>
<!--  -*nogap
    Add the gap symbol to the set of missing characters. -->
					<parameter type="Switch">
					<name>set_nogap</name>
					<attributes>
						<prompt>Add the gap symbol to the set of missing characters (--nogap)</prompt>
						<format>
							<language>perl</language>
							<code>($value) ? "--nogap":""</code>
						</format>
						<group>9</group>
						<comment>
							<value>Add the gap symbol to the set of missing characters.</value>
						</comment>
					</attributes>
				</parameter>    
    
<!-- -*noconstant
    Suppress constant columns in the output MSA. -->
    			<parameter type="Switch">
					<name>suppress_constant</name>
					<attributes>
						<prompt>Suppress constant columns in the output MSA. (--noconstant)</prompt>
						<format>
							<language>perl</language>
							<code>($value) ? "--noconstant":""</code>
						</format>
						<group>10</group>
						<comment>
							<value>Suppress constant columns in the output MSA.</value>
						</comment>						
					</attributes>
				</parameter>   
    
<!-- 
-*ordering nnet|qnet|rand[,INT]|all|INT(,INT)*
    Set the method to calculate the cyclic order to one of the two major methods NeighborNet which is the default or QNet.

    With rand a random sample of all possible orderings of the TAXA can be specified for which the 
    reliability score is calculated. The size of the random sample (default is 1000) can be set by 
    adding an integer after a comma to rand i.e. rand,42. (All orderings with a smaller reliability 
    than cutoff are singled out to a text file with "_best.gr" as post fix)

    If all is used than for all possible permutations of the TAXA the reliability score is calculated 
    (Note that for more than 8 TAXA this can become rather time consuming!).

    Keep in mind that the qnet algorithm is O(n^4) both in time and memory requirements where n is the 
    number of taxa in the input alignment. This limits the number of taxa to around 120 for all 
    practical purposes. (Note: the current implemented maximum number of taxa is 338 which requires 
    about 30GB of memory!)

    Finally a particular cyclic ordering can be specified by a comma-separated list of TAXA indices in 
    the range [0, NumberOfTAXA[ (no spaces are allowed) e.g 3,0,4,1,2 as ordering for the 5 TAXA in the
    input MSA. -->
    
				<parameter type="Excl"  issimple="0">
					<name>specify_ordering</name>
					<attributes>
						<prompt>Set the method to calculate the cyclic order (--ordering)</prompt>
						<vlist>
							<value>nnet</value>
							<label>nnet</label>
							<value>qnet</value>
							<label>qnet</label>
							<value>rand</value>
							<label>rand</label>
							<value>all</value>
							<label>all</label>
							<value>INT</value>
							<label>INT</label>
						</vlist>
						<flist>
							<value>nnet</value>
							<code>"--ordering nnet"</code>
							<value>qnet</value>
							<code>"--ordering qnet"</code>
							<value>rand</value>
							<code>"--ordering rand,$specify_randint"</code>
							<value>all</value>
							<code>"--ordering all"</code>
							<value>INT</value>
							<code>"--ordering $specify_intint "</code>
						</flist>
						<group>11</group>
						<warns>
							<warn>
								<message>If the number of tax for the all option is greater than 8, the run can become quite lengthy</message>
								<language>perl</language>
								<code>$specify_ordering eq "all" &amp;&amp; $specify_ntaxa &gt; 8 </code>
							</warn>
							<warn>
								<message>More than 120 taxa can cause you to run out of memory. Consider the more memory option</message>
								<language>perl</language>
								<code>$specify_ntaxa &gt; 120 </code>
							</warn>
							<warn>
								<message>Sorry, Noisy cant handle more than 338 taxa</message>
								<language>perl</language>
								<code>$specify_ntaxa &gt; 338 </code>
							</warn>
						</warns>
						<comment>
<value></value>
						</comment>
					</attributes>
				</parameter>
				
				<parameter type="Integer">
					<name>specify_randint</name>
					<attributes>
						<prompt>Specify an integer for the ordering (rand or INT)</prompt>
						<precond>
							<language>perl</language>
							<code>$specify_ordering eq "rand" </code>
						</precond>
						<group>12</group>
						<ctrls>
							<ctrl>
								<message>Please enter an integer for the ordering</message>
								<language>perl</language>
								<code>$specify_ordering eq "rand" &amp;&amp; !defined $specify_intint</code>
							</ctrl>
						</ctrls>
						<comment>
						<value>With rand a random sample of all possible orderings of the TAXA can be specified for which the 
    reliability score is calculated. The size of the random sample (default is 1000) can be set by 
    adding an integer after a comma to rand i.e. rand,42. (All orderings with a smaller reliability 
    than cutoff are singled out to a text file with "_best.gr" as post fix)</value>
						</comment>
					</attributes>
				</parameter>

				<parameter type="String">
					<name>specify_intint</name>
					<attributes>
						<prompt>Specify a cyclic ordering (INT)</prompt>
						<precond>
							<language>perl</language>
							<code>$specify_ordering eq "INT"</code>
						</precond>
						<group>13</group>
						<ctrls>
							<ctrl>
								<message>Please enter an integer for the ordering</message>
								<language>perl</language>
								<code>$specify_ordering = "INT"  &amp;&amp; !defined $specify_intint</code>
							</ctrl>
						</ctrls>
						<comment>
<value>Specified by a comma-separated list of TAXA indices in the range [0, NumberOfTAXA[ (no spaces are allowed) e.g 3,0,4,1,2 as ordering for the 5 TAXA in the
    input MSA.</value>
						</comment>
					</attributes>
				</parameter>				
    
<!--  -*shuffles INT
    Perform INT random shufflings per column of the MSA. -->
    			<parameter type="Integer">
					<name>specify_shuffles</name>
					<attributes>
						<prompt>Specify number of random shufflings per column of the MSA (--shuffles)</prompt>
						<format>
							<language>perl</language>
							<code>(defined $value) ? "--shuffles $value":""</code>
						</format>
						<group>14</group>
						<comment>
<value>Perform INT random shufflings per column of the MSA.</value>
						</comment>
						<group>9</group>
					</attributes>
				</parameter>
 
<!--  -*smooth INT
    Calculate a running average over the reliability score of INT columns and use this smoothed values to remove unreliable columns from the MAS. -->
     			<parameter type="Integer">
					<name>specify_smoothing</name>
					<attributes>
						<prompt>Calculate a running average over the reliability score of x columns (--smooth)</prompt>
						<format>
							<language>perl</language>
							<code>(defined $value) ? "--smooth $value":""</code>
						</format>
						<group>15</group>
						<comment>
<value>Calculate a running average over the reliability score of INT columns and use this smoothed values to remove unreliable columns from the MAS. </value>
						</comment>
					</attributes>
				</parameter>   
				
<!--  -*seqtype D|P|R
    Set sequence type of input MSA to DNA which is the default Protein or RNA. This information is used by NeighborNet during distance matrix calculation. -->
    				<parameter type="Excl"  issimple="0">
					<name>specify_datatype</name>
					<attributes>
						<prompt>Set sequence type of input MSA (--seqtype)</prompt>
						<vlist>
							<value>D</value>
							<label>DNA</label>
							<value>P</value>
							<label>Protein</label>
							<value>R</value>
							<label>RNA</label>
						</vlist>
						<vdef>
							<value>D</value>
						</vdef>
							<format>
							<language>perl</language>
							<code>"--seqtype $value"</code>
						</format>
						<group>16</group>
						<comment>
<value>Set sequence type of input MSA to DNA which is the default Protein or RNA. This information is used by NeighborNet during distance matrix calculation.</value>
						</comment>
					</attributes>
				</parameter>
				
<!--  -v, -*verbose Increase the verbosity level. -->
    			<parameter type="Switch">
					<name>increase_verbosity</name>
					<attributes>
						<prompt>Increase the verbosity level (--verbose)</prompt>
						<group>7</group>
						<format>
							<language>perl</language>
							<code>($value) ? "--verbose":""</code>
						</format>
						<group>17</group>
						<comment>
							<value>Provide more verbose output.</value>
						</comment>
					</attributes>
				</parameter>   

</parameters>
</pise>


