Frontiers in Phylogenetics Symposia
Frontiers in Phylogenetics 2014 Symposium Available to Watch On-line
The Frontiers in Phylogenetics 4th Annual Symposium is available to watch on Ustream-SI in unedited form in three parts.
Part 1) http://www.ustream.tv/recorded/52713111 Opening (Michael Braun, John Kress, Guillermo Orti), Lacey Knowles, Kevin Kocot, Ingo Ebersberger…
Part 2) http://www.ustream.tv/recorded/52716590 Ingo Ebersberger continued, Derick Zwickl, Dave Swofford…
Part 3) http://www.ustream.tv/recorded/52720049 Dave Swofford continued, Luay Nakleh, Bastien Boussau, round table discussion.
An edited podcast of the event will be available on iTunes at a later date, to be announced.
PREVIOUS SYMPOSIA:
Frontiers in Phylogenetics Spring Symposium
Baird Auditorium, National Museum of Natural History
Washington, DC, May 20-21, 2013
Recordings from the Third Annual Spring Symposium, Genome-scale Phylogenetics, hosted by NMNH's Frontiers in Phylogenetics Program are now available on iTunesU for free.
You can access the recordings using this link:
https://itunes.apple.com/us/itunes-u/frontiers-in-phylogenetics/id677376797?mt=10
"Genome-Scale Phylogenetics"
Monday May 20, 2013
9:30-9:35 Opening Remarks and Logistics
Michael Braun, Frontiers in Phylogenetics Program, NMNH
9:35-9:45 Introduction and Welcome to the Smithsonian
Eva Pell, Undersecretary for Science, Smithsonian Institution
9:45-10:30 My Students Could Do My Thesis in Five Minutes; How to Cope with the Next Generation
Rob DeSalle, Sackler Institute of Comparative Genomics, AMNH
10:30-11:00 Using Whole Genomes to Resolve the Avian Tree of LifeErich Jarvis, Duke University Medical Center
11:00-11:30 Break
11:30-12:00 Molecular Phylogenies, Genomics and the Bacterial Species Concept
Margaret Riley, University of Massachusetts Amherst
12:00-12:30 Phylotranscriptomics to Bring the Understudied Ostracoda into the Fold
Todd Oakley, University of California Santa Barbara
12:30-14:00 Lunch Break
14:00-14:30 Evolution via the Grape Vine -- Insights from Transcriptome Sequence DataJun Wen, Department of Botany, NMNH
14:30-15:00 Genome-scale Phylogenetics of Rapid Adaptive Radiation: RAD Sequence Data
Illuminates the History of Lake Victoria Cichlids
Catherine Wagner, Eawag, Swiss Federal Institute for Aquatic Science and Technology
15:00-15:30 Break
15:30-16:00 Shotgun in the Dark or a Rifle in the Daylight? The Case for Using Single Copy Orthologous Gene Capture in Phylogenetics
Gavin Naylor, Hollings Marine Lab, College of Charleston and Medical University of South Carolina
16:00-16:30 Achieving phylogenomic nirvana: ultraconserved elements (UCEs) capture history at the species, population, and individual levels
Brant Faircloth, Department of Ecology and Evolutionary Biology, UCLA
16:30-17:00 Unsolved Challenges- Panel Discussion on Future Directions
On May 21, there will be several interactive discussion groups aimed at current issues and hurdles involved with executing phylogenetic research on the genome scale.
PREVIOUS SYMPOSIA
To access the 2011 Symposium, search 'Smithsonian nex gen' on iTunes U or click here:
https://itunes.apple.com/us/itunes-u/next-gen-sequencing/id419187633?mt=10
2012: "Sequence Alignment and Tree Estimation"
Frontiers in Phylogenetics Spring Symposium
Baird Auditorium, National Museum of Natural History
Washington, DC, Sunday May 20, 2012
8:30-9:30 Morning Beverage Service, Baird Auditorium Foyer
9:30-9:35 Opening Remarks and Logistics
Michael Braun, Frontiers in Phylogenetics Program, NMNH
9:35-9:45 Introduction and Welcome to the Smithsonian
Jonathan Coddington, Associate Director of Research and Collections, NMNH
9:45-10:30 An Overview of Multiple Sequence Alignment Methods
Kazutaka Katoh, IFReC, Osaka University, Japan; CBRC, AIST, Japan.
10:30-11:00 Phylogeny-aware Progressive Sequence Alignment
Ari Löytynoja, Institute of Biotechnology, University of Helsinki, Finland
11:00-11:30 Break
11:30-12:00 Bayesian Co-estimation of Alignment and Phylogeny
Ben Redelings, National Evolutionary Synthesis Center (NESCENT)
12:00-12:30 SATé: Simultaneous Alignment and Tree Estimation for Large Datasets
Tandy Warnow, University of Texas at Austin
12:30-14:00 Lunch Break
14:00-14:30 Phylogenomics Across the Green Plant Tree of Life
Jim Leebens-Mack, University of Georgia
14:30-15:00 Fast, Accurate Multiple Sequence/Structure Alignment Using MAFFTash
Daron Standley, Systems Immunology Lab, IFReC, Osaka University
15:00-15:30 Break
15:30-16:00 A Simple Insertion-Deletion Mixture Model for Phylogenetic Inference
Derrick Zwickl, University of Arizona
16:00-16:30 Impact of DNA Sequence Alignment on Estimates of the Avian Tree of Life
Michael Braun, National Museum of Natural History
16:30-17:00 Unsolved Challenges- Panel Discussion on Future Directions
17:00-19:00 Reception in Executive Conference Room, NMNH
ABSTRACTS
An Overview of Multiple Sequence Alignment Methods
Kazutaka Katoh, IFReC, Osaka University, Japan; CBRC, AIST, JapanMultiple sequence alignment (MSA) is one of the oldest problems in bioinformatics, and over the last few decades it has been actively studied as a key technology in sequence analysis. The reason why MSA is important is that its quality greatly affects the success of downstream analyses, such as phylogeny inference, protein structure prediction, etc. To introduce this symposium, I review three basic MSA techniques, the progressive method, the iterative refinement method and the consistency-based method. Then, I discuss why MSA is so difficult and complicated. In my opinion, the difficulty comes from the fact that MSA is essentially a problem of evolution. We should consider how the sequences in an MSA evolved, but we cannot know the true evolutionary process behind present-day sequences. As a result, the correctness of an MSA is difficult to assess. Computer simulation is sometimes a powerful approach in such situations. However, the validity of simulation settings is always problematic and the results of simulation-based studies often depend strongly on parameter settings in both the simulation stage and the analysis stage. As an alternative approach, I show our assessment studies based on actual biological data. In agreement with recent findings by other groups, our studies reveal that different methods perform better for different purposes. I also discuss possible future directions of MSA methods for improving quality, speed and scalability.
Phylogeny-aware Progressive Sequence Alignment
Ari Löytynoja, Institute of Biotechnology, University of Helsinki, FinlandWidely-used multiple sequence alignment methods are based on progressive algorithms. These methods build the multiple alignment solution from pairwise alignments that are performed according to a guide phylogeny. A challenge of progressive approaches is that insertions cannot be distinguished from deletions in a comparison of two sequences but the two evolutionary events have very different effects on the resulting multiple alignment. We showed earlier that methods that do not differentiate the two event types make systematic errors in the alignment and these errors then bias downstream analyses. As a solution, we proposed a new phylogeny-aware algorithm that distinguishes insertions from deletions using outgroup information. PRANK, the method implementing our ideas, has been shown to perform exceptionally well in comparative analyses. However, the algorithm relies on information within the guide tree and PRANK should be used with care in phylogenetic analyses. To overcome the limitations, we are re-implementing the algorithm using graph representation of sequences. The first version of our new method, called PAGAN, is aimed for phylogeny-aware extension of existing sequence alignments with new data and has been applied to analyses of next-generation sequencing data and in metagenomic studies. The graph-based concepts are flexible and we aim to develop the new method towards co-estimation of alignment and phylogeny.
Bayesian Co-estimation of Alignment and Phylogeny
Benjamin Redelings, National Evolutionary Synthesis CenterEstimating homology between divergent sequences is problematic because of the large degree of uncertainty in the inferred multiple sequence alignments. When ignored, this uncertainty can lead to sequence-alignment errors that propagate through the inference pipeline, negatively affecting downstream estimates of phylogenetic trees, evolutionary parameters, and functional annotations. In order to obtain robust estimates in such cases, we propose to jointly estimate the alignment, phylogeny, and other parameters together in a Bayesian framework. This allows us to take advantage of information about the tree when inferring the alignment, without bias towards an external guide tree. It also allows us to estimate site properties, such as whether sites are under positive selection, without placing exaggerated confidence in a single alignment. Instead, the cloud of near-optimal alignments is considered, with each alignment weighted in proportion to its posterior probability. Initial test results based on simulated data suggest that Bayesian co-estimation is able to yield relatively accurate alignments on highly divergent sequences. The Bayesian paradigm yields not only accurate tree estimates and alignment estimates, but also yields measures of support for each clade on the tree, and for each pair of residues that may, or may not, be aligned. Future work will enable an accurate tree and alignment estimate to be obtained relatively quickly, while precise measures of support will require more computing resources.
SATe: Simultaneous Alignment and Tree Estimation
Tandy Warnow, Department of Computer Sciences, The University of Texas at AustinMolecular sequences evolve under processes that include substitutions, insertions, and deletions (jointly called "indels"), as well as other mechanisms (e.g., duplications and rearrangements). The inference of the evolutionary history of these sequences has thus been performed in two stages: the first estimates the alignment on the sequences, and the second estimates the tree given that alignment. While such methods seem to work well on relatively small datasets, these two-stage approaches can produce highly incorrect trees and alignments when applied to large datasets, or ones that evolve with many indels. Co-estimation methods based upon statistical models of evolution that include indel events have the potential to produce trees of much greater accuracy, but are computationally too intensive to use on very large datasets.
In this talk, I will present SATe (Liu et al., Science 2009, and Liu et al., Systematic Biology 2012), a method for co-estimation of alignments and trees. SATe iterates between tree estimation and alignment estimation, and uses a novel divide-and-conquer technique to estimate alignments given the current tree. Our studies onboth real and simulated data, shows that SATe produces much more accurate trees than the current best two-phase methods, and can do so fairly efficiently. Our current research shows dramatic improvements on datasets with up to 27K sequences.
Time permitting, I will also show results of two new methods: SEPP (SATe-enabled Phylogenetic Placement) and DACTAL (divide-and-conquer trees almost without alignments). SEPP can be used to place short fragmentary sequences (as small as 60 bp) into an existing tree,
and DACTAL can be used to estimate very large trees with tens of thousands of leaves without needing to estimate a multiple sequence alignment on the full dataset.
For more information on SATe, SEPP, and DACTAL, see http://www.cs.utexas.edu/users/phylo/software. SATe is joint work with Kevin Liu, Siavash Mirarab, Mark Holder, Serita Nelesen, and Randy Linder; SEPP is joint work with Siavash Mirarab and Nam Nguyen; and DACTAL is joint work with Serita Nelesen, Kevin Liu, Li-San Wang, and Randy Linder.
Phylogenomics Across the Green Plant Tree of LifeJim Leebens-Mack, Department of Plant Biology, University of GeorgiaFor over more than a billion years of evolutionary history, green plants (Viridiplantae) have had a profound influence on our global environment and biota. This evolutionary history is marked by exquisite innovations including multiple origins of multicellularity, colonizations of marine, freshwater and terrestrial habitats, adaptation to extreme environments, embryogenesis, the development of vascular systems, the origin of the seed and the origin of the flower. This talk will describe how we are using a small set of annotated plant genome sequences and a massive set of transcriptome data from over 1000 green plant species to develop a comprehensive understanding of plant relationships and gene family diversity. Challenges for phylogentic analysis of these data include assessment of orthology, long branches, missing data and among-linage variation in nucleotide frequencies. I will present a data analysis pipeline for organizing and conducting phylogentic analyses on large transcriptome data sets and discuss strategies for diagnosing artifacts due to model misspecification in analyses of large yet incomplete data matrices. Preliminary results of supermatrix and supertree analyses of over 80 taxa show a high level of agreement among trees estimated using a variety of data partitions and inference methods. However, inferences for key regions of the green plant phylogeny – including the identity algal lineage most closely related to land plants, relationships among liverworts, hornworts, mosses and vascular plants, and the position of the Gnetales within the gymnosperms – vary depending on method of analysis. I will discuss strategies for elucidating the basis of this variation and developing a robust phylogenetic context for comparative analyses across the green plant phylogeny.
Fast and Accurate Multiple Sequence/Structure Alignment Using MAFFTash.
Daron M. Standley, Systems Immunology Lab, Immunology Frontier Research Center (IFReC), Osaka UniversityIt is well known that protein structure is more conserved than sequence. For example, even protein domains with undetectable sequence similarity can share the same basic fold. It is no surprise then that structural information has been utilized by a number of multiple sequence alignment (MSA) methods to improve accuracy. Our own efforts along these lines led to the development of the MAFFTash program, which combines ASH pairwise structural alignments with MAFFT MSA calculations. Structural domain decomposition followed by all-against-all domain alignment is used to identify structurally equivalent residue positions, which are then incorporated as restraints in MAFFT. To assess MAFFTash accuracy we constructed homology models based on pairwise alignments extracted from the MSA, and measured their deviations from experimentally determined structures. Using this approach, which avoids complications arising from ambiguity in MSA accuracy, we demonstrate that inclusion of structural information leads to an improvement in alignment accuracy when the identity of the input sequences is below 50%. However, the corresponding increase in run time can be substantial, as the computational complexity is proportional to N2, where N is the number of structural domains to be aligned. To circumvent the overhead of computing structural alignments locally, we have developed a structural domain alignment portal that can be accessed remotely as a RESTful web service. Using this portal, run-times were reduced by an order of magnitude, as each structural alignment is replaced by a database query. Importantly, MAFFT users will not need to install additional third-party software to take advantage of these improvements.
A Simple Insertion-Deletion Mixture Model for Phylogenetic Inference
Derrick Zwickl, University of ArizonaMultiple sequence alignment is typically the first step in the process of inferring phylogenetic trees from sequence data. Alignment compensates for sequence length variation due to insertion and deletion (indel) mutations that have occurred across lineages. These events leave traces that are shared by descendants and can be phylogeneticaly informative. The “gap” character state is used to represent the effects of insertion and deletion events. However, these indel events are nearly always ignored during the phylogenetic inference itself. We have modified an evolutionary model developed by Rivas and Eddy (2008) that models the insertion-deletion process in pre-aligned data matrices. This model allows the use of signal from indel events in addition to residue transitions to infer phylogenies. Our formulation of the model is nearly equivalent to that of Rivas and Eddy, but decomposes the aligned sequence matrix into two: one representing sequence transitions and one representing insertion and deletion events. This separation of the matrices is computationally more convenient, and allows for independent manipulation of the data types. We have implemented this model, termed DIMM (Dollo Indel Mixture Model), in the efficient maximum-likelihood inference software GARLI. We investigate its properties in inferring phylogenies, as well as its utility for purposes such as the detection of misaligned columns in a data matrix.
Impact of DNA Sequence Alignment on Estimates of the Avian Tree of Life
Michael J. Braun, National Museum of Natural History, Smithsonian InstitutionRecent estimates of the phylogeny of birds based on molecular sequence data (e.g., Ericson et al. 2006, Hackett et al. 2008) differ dramatically from most previous estimates and potentially provide important new insights into avian relationships, morphological and ecological adaptation, and biogeography. However, several molecular estimates depend heavily on non-coding DNA elements that vary in length and must be aligned before analysis can proceed. In phylogenetics, such alignments have traditionally been adjusted manually, introducing the possibility of observer error or bias. Recent advances in multiple sequence alignment have resulted in computer algorithms that produce much improved alignments, allowing us to probe the impact of completely automated alignments on the inferred tree. We tested six automated alignment methods on the largest available molecular dataset (Hackett et al. 2008), and found that five of them produced alignments and trees very similar to each other and to the published tree based on the manually curated alignment. The outlier was ClustalW, an older method that produced a slightly divergent tree that was nevertheless much more similar to the tree based on manual alignment than to any previous estimate. We also tested the effect of starting trees on an automated method (SATé) that iterates between alignment and tree estimation. Whether we started with very divergent trees from previous estimates or even with randomly generated trees, we always recovered trees very similar to the published tree based on the manually curated alignment.
Frontiers in Phylogenetics Spring Symposium
Baird Auditorium, National Museum of Natural History
Washington, DC, May 20-21, 2013
Recordings from the Third Annual Spring Symposium, Genome-scale Phylogenetics, hosted by NMNH's Frontiers in Phylogenetics Program are now available on iTunesU for free.
You can access the recordings using this link:
https://itunes.apple.com/us/itunes-u/frontiers-in-phylogenetics/id677376797?mt=10
"Genome-Scale Phylogenetics"
Monday May 20, 2013
9:30-9:35 Opening Remarks and Logistics
Michael Braun, Frontiers in Phylogenetics Program, NMNH
9:35-9:45 Introduction and Welcome to the Smithsonian
Eva Pell, Undersecretary for Science, Smithsonian Institution
9:45-10:30 My Students Could Do My Thesis in Five Minutes; How to Cope with the Next Generation
Rob DeSalle, Sackler Institute of Comparative Genomics, AMNH
10:30-11:00 Using Whole Genomes to Resolve the Avian Tree of LifeErich Jarvis, Duke University Medical Center
11:00-11:30 Break
11:30-12:00 Molecular Phylogenies, Genomics and the Bacterial Species Concept
Margaret Riley, University of Massachusetts Amherst
12:00-12:30 Phylotranscriptomics to Bring the Understudied Ostracoda into the Fold
Todd Oakley, University of California Santa Barbara
12:30-14:00 Lunch Break
14:00-14:30 Evolution via the Grape Vine -- Insights from Transcriptome Sequence DataJun Wen, Department of Botany, NMNH
14:30-15:00 Genome-scale Phylogenetics of Rapid Adaptive Radiation: RAD Sequence Data
Illuminates the History of Lake Victoria Cichlids
Catherine Wagner, Eawag, Swiss Federal Institute for Aquatic Science and Technology
15:00-15:30 Break
15:30-16:00 Shotgun in the Dark or a Rifle in the Daylight? The Case for Using Single Copy Orthologous Gene Capture in Phylogenetics
Gavin Naylor, Hollings Marine Lab, College of Charleston and Medical University of South Carolina
16:00-16:30 Achieving phylogenomic nirvana: ultraconserved elements (UCEs) capture history at the species, population, and individual levels
Brant Faircloth, Department of Ecology and Evolutionary Biology, UCLA
16:30-17:00 Unsolved Challenges- Panel Discussion on Future Directions
On May 21, there will be several interactive discussion groups aimed at current issues and hurdles involved with executing phylogenetic research on the genome scale.
PREVIOUS SYMPOSIA
To access the 2011 Symposium, search 'Smithsonian nex gen' on iTunes U or click here:
https://itunes.apple.com/us/itunes-u/next-gen-sequencing/id419187633?mt=10
2012: "Sequence Alignment and Tree Estimation"
Frontiers in Phylogenetics Spring Symposium
Baird Auditorium, National Museum of Natural History
Washington, DC, Sunday May 20, 2012
8:30-9:30 Morning Beverage Service, Baird Auditorium Foyer
9:30-9:35 Opening Remarks and Logistics
Michael Braun, Frontiers in Phylogenetics Program, NMNH
9:35-9:45 Introduction and Welcome to the Smithsonian
Jonathan Coddington, Associate Director of Research and Collections, NMNH
9:45-10:30 An Overview of Multiple Sequence Alignment Methods
Kazutaka Katoh, IFReC, Osaka University, Japan; CBRC, AIST, Japan.
10:30-11:00 Phylogeny-aware Progressive Sequence Alignment
Ari Löytynoja, Institute of Biotechnology, University of Helsinki, Finland
11:00-11:30 Break
11:30-12:00 Bayesian Co-estimation of Alignment and Phylogeny
Ben Redelings, National Evolutionary Synthesis Center (NESCENT)
12:00-12:30 SATé: Simultaneous Alignment and Tree Estimation for Large Datasets
Tandy Warnow, University of Texas at Austin
12:30-14:00 Lunch Break
14:00-14:30 Phylogenomics Across the Green Plant Tree of Life
Jim Leebens-Mack, University of Georgia
14:30-15:00 Fast, Accurate Multiple Sequence/Structure Alignment Using MAFFTash
Daron Standley, Systems Immunology Lab, IFReC, Osaka University
15:00-15:30 Break
15:30-16:00 A Simple Insertion-Deletion Mixture Model for Phylogenetic Inference
Derrick Zwickl, University of Arizona
16:00-16:30 Impact of DNA Sequence Alignment on Estimates of the Avian Tree of Life
Michael Braun, National Museum of Natural History
16:30-17:00 Unsolved Challenges- Panel Discussion on Future Directions
17:00-19:00 Reception in Executive Conference Room, NMNH
ABSTRACTS
An Overview of Multiple Sequence Alignment Methods
Kazutaka Katoh, IFReC, Osaka University, Japan; CBRC, AIST, JapanMultiple sequence alignment (MSA) is one of the oldest problems in bioinformatics, and over the last few decades it has been actively studied as a key technology in sequence analysis. The reason why MSA is important is that its quality greatly affects the success of downstream analyses, such as phylogeny inference, protein structure prediction, etc. To introduce this symposium, I review three basic MSA techniques, the progressive method, the iterative refinement method and the consistency-based method. Then, I discuss why MSA is so difficult and complicated. In my opinion, the difficulty comes from the fact that MSA is essentially a problem of evolution. We should consider how the sequences in an MSA evolved, but we cannot know the true evolutionary process behind present-day sequences. As a result, the correctness of an MSA is difficult to assess. Computer simulation is sometimes a powerful approach in such situations. However, the validity of simulation settings is always problematic and the results of simulation-based studies often depend strongly on parameter settings in both the simulation stage and the analysis stage. As an alternative approach, I show our assessment studies based on actual biological data. In agreement with recent findings by other groups, our studies reveal that different methods perform better for different purposes. I also discuss possible future directions of MSA methods for improving quality, speed and scalability.
Phylogeny-aware Progressive Sequence Alignment
Ari Löytynoja, Institute of Biotechnology, University of Helsinki, FinlandWidely-used multiple sequence alignment methods are based on progressive algorithms. These methods build the multiple alignment solution from pairwise alignments that are performed according to a guide phylogeny. A challenge of progressive approaches is that insertions cannot be distinguished from deletions in a comparison of two sequences but the two evolutionary events have very different effects on the resulting multiple alignment. We showed earlier that methods that do not differentiate the two event types make systematic errors in the alignment and these errors then bias downstream analyses. As a solution, we proposed a new phylogeny-aware algorithm that distinguishes insertions from deletions using outgroup information. PRANK, the method implementing our ideas, has been shown to perform exceptionally well in comparative analyses. However, the algorithm relies on information within the guide tree and PRANK should be used with care in phylogenetic analyses. To overcome the limitations, we are re-implementing the algorithm using graph representation of sequences. The first version of our new method, called PAGAN, is aimed for phylogeny-aware extension of existing sequence alignments with new data and has been applied to analyses of next-generation sequencing data and in metagenomic studies. The graph-based concepts are flexible and we aim to develop the new method towards co-estimation of alignment and phylogeny.
Bayesian Co-estimation of Alignment and Phylogeny
Benjamin Redelings, National Evolutionary Synthesis CenterEstimating homology between divergent sequences is problematic because of the large degree of uncertainty in the inferred multiple sequence alignments. When ignored, this uncertainty can lead to sequence-alignment errors that propagate through the inference pipeline, negatively affecting downstream estimates of phylogenetic trees, evolutionary parameters, and functional annotations. In order to obtain robust estimates in such cases, we propose to jointly estimate the alignment, phylogeny, and other parameters together in a Bayesian framework. This allows us to take advantage of information about the tree when inferring the alignment, without bias towards an external guide tree. It also allows us to estimate site properties, such as whether sites are under positive selection, without placing exaggerated confidence in a single alignment. Instead, the cloud of near-optimal alignments is considered, with each alignment weighted in proportion to its posterior probability. Initial test results based on simulated data suggest that Bayesian co-estimation is able to yield relatively accurate alignments on highly divergent sequences. The Bayesian paradigm yields not only accurate tree estimates and alignment estimates, but also yields measures of support for each clade on the tree, and for each pair of residues that may, or may not, be aligned. Future work will enable an accurate tree and alignment estimate to be obtained relatively quickly, while precise measures of support will require more computing resources.
SATe: Simultaneous Alignment and Tree Estimation
Tandy Warnow, Department of Computer Sciences, The University of Texas at AustinMolecular sequences evolve under processes that include substitutions, insertions, and deletions (jointly called "indels"), as well as other mechanisms (e.g., duplications and rearrangements). The inference of the evolutionary history of these sequences has thus been performed in two stages: the first estimates the alignment on the sequences, and the second estimates the tree given that alignment. While such methods seem to work well on relatively small datasets, these two-stage approaches can produce highly incorrect trees and alignments when applied to large datasets, or ones that evolve with many indels. Co-estimation methods based upon statistical models of evolution that include indel events have the potential to produce trees of much greater accuracy, but are computationally too intensive to use on very large datasets.
In this talk, I will present SATe (Liu et al., Science 2009, and Liu et al., Systematic Biology 2012), a method for co-estimation of alignments and trees. SATe iterates between tree estimation and alignment estimation, and uses a novel divide-and-conquer technique to estimate alignments given the current tree. Our studies onboth real and simulated data, shows that SATe produces much more accurate trees than the current best two-phase methods, and can do so fairly efficiently. Our current research shows dramatic improvements on datasets with up to 27K sequences.
Time permitting, I will also show results of two new methods: SEPP (SATe-enabled Phylogenetic Placement) and DACTAL (divide-and-conquer trees almost without alignments). SEPP can be used to place short fragmentary sequences (as small as 60 bp) into an existing tree,
and DACTAL can be used to estimate very large trees with tens of thousands of leaves without needing to estimate a multiple sequence alignment on the full dataset.
For more information on SATe, SEPP, and DACTAL, see http://www.cs.utexas.edu/users/phylo/software. SATe is joint work with Kevin Liu, Siavash Mirarab, Mark Holder, Serita Nelesen, and Randy Linder; SEPP is joint work with Siavash Mirarab and Nam Nguyen; and DACTAL is joint work with Serita Nelesen, Kevin Liu, Li-San Wang, and Randy Linder.
Phylogenomics Across the Green Plant Tree of LifeJim Leebens-Mack, Department of Plant Biology, University of GeorgiaFor over more than a billion years of evolutionary history, green plants (Viridiplantae) have had a profound influence on our global environment and biota. This evolutionary history is marked by exquisite innovations including multiple origins of multicellularity, colonizations of marine, freshwater and terrestrial habitats, adaptation to extreme environments, embryogenesis, the development of vascular systems, the origin of the seed and the origin of the flower. This talk will describe how we are using a small set of annotated plant genome sequences and a massive set of transcriptome data from over 1000 green plant species to develop a comprehensive understanding of plant relationships and gene family diversity. Challenges for phylogentic analysis of these data include assessment of orthology, long branches, missing data and among-linage variation in nucleotide frequencies. I will present a data analysis pipeline for organizing and conducting phylogentic analyses on large transcriptome data sets and discuss strategies for diagnosing artifacts due to model misspecification in analyses of large yet incomplete data matrices. Preliminary results of supermatrix and supertree analyses of over 80 taxa show a high level of agreement among trees estimated using a variety of data partitions and inference methods. However, inferences for key regions of the green plant phylogeny – including the identity algal lineage most closely related to land plants, relationships among liverworts, hornworts, mosses and vascular plants, and the position of the Gnetales within the gymnosperms – vary depending on method of analysis. I will discuss strategies for elucidating the basis of this variation and developing a robust phylogenetic context for comparative analyses across the green plant phylogeny.
Fast and Accurate Multiple Sequence/Structure Alignment Using MAFFTash.
Daron M. Standley, Systems Immunology Lab, Immunology Frontier Research Center (IFReC), Osaka UniversityIt is well known that protein structure is more conserved than sequence. For example, even protein domains with undetectable sequence similarity can share the same basic fold. It is no surprise then that structural information has been utilized by a number of multiple sequence alignment (MSA) methods to improve accuracy. Our own efforts along these lines led to the development of the MAFFTash program, which combines ASH pairwise structural alignments with MAFFT MSA calculations. Structural domain decomposition followed by all-against-all domain alignment is used to identify structurally equivalent residue positions, which are then incorporated as restraints in MAFFT. To assess MAFFTash accuracy we constructed homology models based on pairwise alignments extracted from the MSA, and measured their deviations from experimentally determined structures. Using this approach, which avoids complications arising from ambiguity in MSA accuracy, we demonstrate that inclusion of structural information leads to an improvement in alignment accuracy when the identity of the input sequences is below 50%. However, the corresponding increase in run time can be substantial, as the computational complexity is proportional to N2, where N is the number of structural domains to be aligned. To circumvent the overhead of computing structural alignments locally, we have developed a structural domain alignment portal that can be accessed remotely as a RESTful web service. Using this portal, run-times were reduced by an order of magnitude, as each structural alignment is replaced by a database query. Importantly, MAFFT users will not need to install additional third-party software to take advantage of these improvements.
A Simple Insertion-Deletion Mixture Model for Phylogenetic Inference
Derrick Zwickl, University of ArizonaMultiple sequence alignment is typically the first step in the process of inferring phylogenetic trees from sequence data. Alignment compensates for sequence length variation due to insertion and deletion (indel) mutations that have occurred across lineages. These events leave traces that are shared by descendants and can be phylogeneticaly informative. The “gap” character state is used to represent the effects of insertion and deletion events. However, these indel events are nearly always ignored during the phylogenetic inference itself. We have modified an evolutionary model developed by Rivas and Eddy (2008) that models the insertion-deletion process in pre-aligned data matrices. This model allows the use of signal from indel events in addition to residue transitions to infer phylogenies. Our formulation of the model is nearly equivalent to that of Rivas and Eddy, but decomposes the aligned sequence matrix into two: one representing sequence transitions and one representing insertion and deletion events. This separation of the matrices is computationally more convenient, and allows for independent manipulation of the data types. We have implemented this model, termed DIMM (Dollo Indel Mixture Model), in the efficient maximum-likelihood inference software GARLI. We investigate its properties in inferring phylogenies, as well as its utility for purposes such as the detection of misaligned columns in a data matrix.
Impact of DNA Sequence Alignment on Estimates of the Avian Tree of Life
Michael J. Braun, National Museum of Natural History, Smithsonian InstitutionRecent estimates of the phylogeny of birds based on molecular sequence data (e.g., Ericson et al. 2006, Hackett et al. 2008) differ dramatically from most previous estimates and potentially provide important new insights into avian relationships, morphological and ecological adaptation, and biogeography. However, several molecular estimates depend heavily on non-coding DNA elements that vary in length and must be aligned before analysis can proceed. In phylogenetics, such alignments have traditionally been adjusted manually, introducing the possibility of observer error or bias. Recent advances in multiple sequence alignment have resulted in computer algorithms that produce much improved alignments, allowing us to probe the impact of completely automated alignments on the inferred tree. We tested six automated alignment methods on the largest available molecular dataset (Hackett et al. 2008), and found that five of them produced alignments and trees very similar to each other and to the published tree based on the manually curated alignment. The outlier was ClustalW, an older method that produced a slightly divergent tree that was nevertheless much more similar to the tree based on manual alignment than to any previous estimate. We also tested the effect of starting trees on an automated method (SATé) that iterates between alignment and tree estimation. Whether we started with very divergent trees from previous estimates or even with randomly generated trees, we always recovered trees very similar to the published tree based on the manually curated alignment.