GCG-2017 HybSeq

From systematics

Jump to: navigation, search



Overview of what Tamara and Sangtae and Isabel are doing

Isabel: Kew

This is part of a big focus on plant and fungal tree of life. They've been working with 1kp to design a bait kit for all of angiosperms. Kits are designed, and they are working with MyBaits to develop the kits. At this point, working on one sp. per order, with deeper sampling in the Magnoliaceae. The kit is really designed for all angiosperms. 350 genes. About to start library prep... some tests have been done, but the test as a whole is not yet started. The first test will be one sample per plant order. Cyperaceae will be one of the first to go at the genus level, and Isabel has passed along 120 samples. Relationship with FSU: the goal here was to create a bait set that can be done internally... hoping to come to 30-50 pounds per sample. The kit will be available as a standardized kit. Also hoping to get cpDNA and nrDNA opportunistically. Isabel has funding to do perhaps 100+ more samples.

Pipeline: still being worked on. MarkerMiner with modifications was used to design the baits.


Also doing HybSeq in Magnoliaceae, focused on single-copy genes, based on preliminary assembly of genome and transcriptome in one species. Obtained ca. 500 single-copy genes. Using MyBaits for probes... then sent samples to RapidGenomics, costs about $100 / sample (David Tank recommended). Also included probes for chloroplast genome, but not mitochondrial genome. Probe sequences: targeted 400 mb of sequence data. Tamara's bait set targeted 700 mb, but the final baitset is ca. 2.4 million bp. Tamara's baitset tiles (overlaps).


review of what was done

Started with Sangtae's transcriptome, and used MarkerMiner to find single-copy genes by reference to Oryza sativa. 490 genes targeted, designed a kit of 20K baits, tiling over the 490 genes. Sequenced samples from silica-dried, fresh, frozen, and whole-genome PCR amplified material (this material worked very well; this would be worth doing for more samples). A couple of samples didn't work... we now have 72 samples in our tree, recovered essentially all genes, as well as presumably plenty of cpDNA off-target. Also working with Euphorbia, and found a lot of chloroplast data from each sample... but also they have a lot of individuals sampled for many species. For those spp that were sequenced for just one individual, cpDNA coverage is rather gappy. Matrices range from 2E06 to 2E07... working on a pipeline to automate this process, get rid of poor loci etc. Results will come tomorrow, but in brief they are quite good. Two analyses: RAxML concatenated, and ASTRAL species tree method. Both trees are pretty good, but there is a lot to do on the alignments to get rid of errors.

Mostly Carex, but we are also working with Isabel's samples for the next run. We would like to know whether our kit gives more resolution than the anchored phylogenomics and the Kew kit, which is angiosperm-wide. Also outgroups: Trichophorum, Schoenoplectus, and Cyperus.

Regarding design: At the outset, it was not clear that the MADS-box genes were to have been included. Question: might there be single-copy genes in Carex that are multiple-copy in Oryza? If so, we might be missing some single-copy genes that are unique to Carex. Eric raises the question of whether the Carex lupulina genome could be used for this, but it's not clear whether the 454 data would be sufficient. Isabel: did you use the 1kp data? Tamara: no; we got the Cyperus transcriptome data from 1kp, but we used primarily the Carex siderosticta transcriptome from Sangtae, then added in 38 loci from the Cyperus transcriptome.

In the future, you could potentially design a probeset that adds in the MADS-box genes, swapping out some of the less informative genes. We also talked about the possibility of reducing the number of targeted genes to 50 or so, allowing us to sequence more individuals at lower cost.

Broader question: will the markers work across family?

MarkerMiner overview

Smet et al. 2013

We map our transcriptome back to the Oryza sativa genome. Sometimes coverage is incomplete, and we allow this. Sometimes there are nucleotide polymorphisms that differ from Oryza, and perhaps dramatically. The resulting probes then are based on the Carex sequence, not the Oryze sequence. But because of the differences from Oryza, these are probably of somewhat limited utility beyond the family. Eric is betting that these probes will work fairly broadly, considering that the genome and transcriptome were from different families. Isabel looked informatically at the overlap between projects, and they find about 20% overlap in genes among projects from different families. Copy number assessment is actually from Arabidopsis (analysis of Smet et al. 2013).

This could be improved. Tamara is now working on a different family, creating her own proteome. At the time she started this, Tamara found that this one gave the best results. Other pipelines: Sandovac and Weitemeier; the latter pipeline assumes transcriptome and genome from same species. Isabel points out that most people use the MarkerMiner pipeline.

Sampling going forward, strategies for scaling up

General question: how do we go forward with this? To date, 72 tips have been sampled, representing the four major clades. For everyone, what are the best options going forward for this method? Sample one of each section? focus on the oddities that are not stable in the barcoding tree? We don't yet know how the probe set will work within a small clade, but we expect they will do well, and we'll know soon. The data are easily combined across labs, and may be more robust in this respect than RADseq (though Andrew doubts this). In Carex, we have good support for individual smaller clades, but less support for relationships among clades.

Personal tools