The marker effects are employed in another aspect of the invention for determining a genomic estimated breeding value of a bovine subject based on the genotype of said bovine subject by correlating its genotype with the effect of each individual genetic marker allele on udder health, fertility, other health and/or estimated breeding value of.
Genomic selection is a form of marker-assisted selection in which genetic markers covering the whole genome are used so that all quantitative trait loci (QTL) are in linkage disequilibrium with at least one marker. This approach has become feasible thanks to the large number of single nucleotide polymorphisms (SNP) discovered by genome sequencing and new methods to efficiently genotype large number of SNP. Simulation results and limited experimental results suggest that breeding values can be predicted with high accuracy using genetic markers alone but more validation is required especially in samples of the population different from that in which the effect of the markers was estimated. The ideal method to estimate the breeding value from genomic data is to calculate the conditional mean of the breeding value given the genotype of the animal at each QTL.
This conditional mean can only be calculated by using a prior distribution of QTL effects so this should be part of the research carried out to implement genomic selection. In practice, this method of estimating breeding values is approximated by using the marker genotypes instead of the QTL genotypes but the ideal method is likely to be approached more closely as more sequence and SNP data is obtained. Implementation of genomic selection is likely to have major implications for genetic evaluation systems and for genetic improvement programmes generally and these are discussed.
BackgroundGenomic selection or genome-wide selection (GS) has been highlighted as a new approach for marker-assisted selection (MAS) in recent years. GS is a form of MAS that selects favourable individuals based on genomic estimated breeding values. Previous studies have suggested the utility of GS, especially for capturing small-effect quantitative trait loci, but GS has not become a popular methodology in the field of plant breeding, possibly because there is insufficient information available on GS for practical use.
ConclusionsStatistical concepts used in GS are discussed with genetic models and variance decomposition, heritability, breeding value and linear model. Recent progress in GS studies is reviewed with a focus on empirical studies. For the practice of GS in plant breeding, several specific points are discussed including linkage disequilibrium, feature of populations and genotyped markers and breeding scheme. Currently, GS is not perfect, but it is a potent, attractive and valuable approach for plant breeding. This method will be integrated into many practical breeding programmes in the near future with further advances and the maturing of its theory.
, INTRODUCTIONGenomic selection or genome-wide selection (GS) has been highlighted as a new approach for marker-assisted selection (MAS) in recent years. GS is a form of MAS that selects favourable individuals based on genomic estimated breeding values (GEBVs). Breeding values have not been a popular index in plant breeding, although they are frequently used in animal breeding. They are defined as ‘the sum of the estimate of genetic deviation and the weighted sum of estimates of breed effects’ , which are predicted using phenotypic data from family pedigrees based on the additive infinitesimal model. Several statistical approaches have been proposed for the prediction of estimated breeding values (EBVs), such as best linear unbiased prediction (BLUP) and a Bayesian framework. Furthermore, an innovative method for predicting breeding values was proposed based on genome-wide dense DNA markers, known as the GEBV.
When the idea of GEBV was proposed, it was regarded as an unrealistic approach because of the lack of large-scale genotyping technologies at the period. However, it has become a feasible approach with recent advances in high-throughput genotyping platforms. The term ‘GS’ was first introduced by Haley and Visscher at the 6th World Congress on Genetics Applied to Livestock Production at Armidale, Australia in 1998 according to, although it was not used in the main text of. However, the overall MAS programme using GEBV was later referred to as GS.The general processes of GS and traditional MAS used for quantitative traits (QTs) are shown in Fig.
The main frameworks of the two approaches are similar, where both GS and traditional MAS consist of training and breeding phases. In the training phase, phenotypes and genome-wide (GW) genotypes are investigated in a subset of a population, i.e.
The training population in GS and the mapping population in traditional MAS. Within populations, significant relationships between phenotypes and genotypes are predicted using statistical approaches. In the breeding phase, genotype data are obtained in a breeding population, before favourable individuals are selected based on the genotype data obtained. Three obvious differences between the two approaches are apparent: (1) in the training phase, quantitative trait loci (QTLs) are identified in traditional MAS while formulae for GEBV prediction are generated in GS, known as GS models; (2) in the breeding phase, genotype data are only required for targeted regions in traditional MAS, whereas GW genotype data are considered to be necessary in GS; (3) in the breeding phase, favourable individuals are selected based on the genotypes of markers in MAS, whereas GEBVs are used for selection in GS.
Thus, GS jointly analyses all the genetic variance of each individual by summing the marker effects of GEBV , and it is expected to address small effect genes that cannot be captured by traditional MAS. Schemes of genomic selection (GS) (left) and traditional MAS for the selection of quantitative traits (right).
Both GS and traditional MAS contained training and breeding phases. In the training phase, quantitative trait loci (QTLs) are identified in traditional MAS to produce formulae for genomic estimated breeding value (GEBV) prediction, i.e. In the breeding phase, favourable individuals are selected based on the genotypes of the selected markers in MAS, whereas GEBVs are used for selection in GS. Schemes of genomic selection (GS) (left) and traditional MAS for the selection of quantitative traits (right). Both GS and traditional MAS contained training and breeding phases.
In the training phase, quantitative trait loci (QTLs) are identified in traditional MAS to produce formulae for genomic estimated breeding value (GEBV) prediction, i.e. In the breeding phase, favourable individuals are selected based on the genotypes of the selected markers in MAS, whereas GEBVs are used for selection in GS.Since GS was first propounded by, many reports have indicated the usability of GS for breeding for QTs. However, GS has still not become a popular methodology in the field of plant breeding. We consider that a major obstacle is the availability of insufficient knowledge of GS for practical use. Indeed, most fields of GS studies have dealt with statistics and simulation that are discussed in terms of formulae, which are often too specific for breeders and molecular biologists to understand. To initiate further discussions on the applicability of GS in plant breeding, here our aim is to discuss GS from a practical breeding viewpoint.
First, the statistical approaches used in GS are briefly explained to understand the essence of this approach. Second, we survey recent progress in GS studies from the areas of animal and plant science, mainly addressing those dealing with empirical data. Third, we describe several specific factors that require careful consideration before practicing GS in plant breeding. Finally, we discuss future prospects for the further advancement of GS and MAS programmes overall. STATISTICAL CONCEPTS USED IN GSAll GS, traditional MAS and pedigree-based phenotypic selection (PS) methods are reliant on a common selection framework, i.e. Finding a causal relationship between genetic factors and target traits based on putative genetic factors underlying the phenotypic distribution (in PS) or observed marker genotypes (in GS and traditional MAS) in a training population. Before describing the statistical approaches used for GEBV prediction, we briefly review the general statistical concepts that are commonly used in PS, traditional MAS and GS.
Genetic models and variance decomposition. Heritability is a measure for evaluating the degree to which the phenotypic characteristics of a population are inherited to the next generation, and it is represented as the ratio of genetic variance to phenotypic variance. Broad sense heritability ( H 2) focuses on the total genetic effects, G, including the additive, dominance and epistatic effects, whereas narrow sense heritability ( h 2) counts only additive genetic effects. Therefore, for h 2, the genetic model ( P = G + E) can be rewritten using the additive genetic effect, A. Here, y i is the phenotypic value of individual i, while m 0 is the mean phenotypic value of the population. Because V( A) cannot be directly observed, h 2 has been conventionally estimated by a comparison of the phenotypic values of parents and their offspring. The BVs that are predicted based on an estimated heritability are known as EBVs.
By contrast, phenotypic value y i and the V( A) in GS are estimated based on the flux of the genotype effects of GW markers. Thus, the BV predicted in GS is known as the genomic EBV (GEBV). Note the residual effect variance V( E′) is ignored in BV prediction, because narrow sense heritability is employed. Linear model for marker effects. In many implementations of GS, the causal relationship between the phenotype and genotype is represented as a linear model or its extension, which is then used to infer the GEBV of an individual in a breeding population. Thus, the linear model is a fundamental model employed in GS.
Here, we assume there are N individuals and M bi-allelic markers in the training population, and we focus on one of the markers. Let ( y i, x 1 i) denote the pair of the observed phenotype and genotype of the marker of the ith individual, i.e.
( y 1, x 11), ( y 2, x 12),( y N, x 1 N). In addition, let us suppose that the bi-allelic genotypes are encoded by 0 and 1, respectively, and that the phenotypes of the N individuals are distributed as shown in Fig. Because an individual gains additional phenotypic value, β 1, depending on its marker genotype, the phenotype can be modelled as follows.
(1) where β 0 and β 1 are the parameters to be determined, and ɛ i is an error term that is usually assumed to have a normal distribution with a mean of zero. This model is represented as a linear combination of the terms, known as a ‘linear model’, showing that the phenotypes of individuals with genotypes 0 and 1 are normally distributed around β 0 and β 0 + β 1, respectively. The parameters of a linear model may be determined by least-squares estimation, such that the summation of ɛ i 2, i.e. An error function E = Σ i ( y i – β 0 – x 1 i β 1) 2, is minimized and the line is fitted to the phenotype. The linear model (1) represents the relationship between the genotype and phenotype for a single marker, but it can be extended to include all the M markers as follows.
Relationships between marker genotypes ( x 1 i: 0 and 1) and phenotypes ( y i) of the individuals (open circles) in a training population. If the marker genotype is correlated with the phenotype, segregation is modelled using the bold line ( y i = β 0 + x 1 i β 1, where β 0 and β 1 are parameters to be determined.).Because GW genotype data are used in GS, a problem often arises, known as ‘large p, small n problem’, when the linear model (2) is employed for GEBV prediction ( p and n are the numbers of markers and individuals, respectively).
That is, a linear model that consists of p markers is too complicated for the prediction of BVs of n individuals. Thus, it can cause over-fitting and the linear model only works well in the training population. To avoid over-fitting, a penalty term is introduced in the error function, i.e. E = Σ N i = 1 ( y i – Σ M j = 0 x ji β j) 2 + λΣ M j = 0 β j q, where λ is a parameter that controls the effects of the penalty term. Note that setting a high β j inhibits the minimization of the error function. Setting q = 1 and q = 2 are known as LASSO (least absolute shrinkage and selection operator) and ridge regression (RR in Table ), respectively.
Ridge regression forces all the coefficients to shrink toward zero equally, while LASSO can set several coefficients that are unrelated to the phenotype to zero. Therefore, if the phenotype is controlled by many markers with small effects, ridge regression will capture those effects , whereas LASSO will capture large effects with a small number of markers. If the coefficients of the markers are set to zero or a low value in the training phase, they are excluded from the model and their genotype information is not required during the breeding phase.For example, ‘least-square estimation’ and ‘BLUP estimation’ for effects of markers or chromosome segments in adopt similar linear models. Here, BLUP stands for best linear unbiased prediction of a parameter. As summarizes, the methods using ridge regression assume that effects of markers have an equal variance.
On the other hand, Bayesian methods that are known as BayesA and BayesB of can make relaxed assumptions to estimate the variances of the effects of markers separately. In a Bayesian framework, effect of a marker is represented by distribution of a random variable that is determined by its prior distribution according to some assumptions. Actually, BayesA and BayesB adopt different prior distributions for the variance of the effects of markers; that of the latter is defined to allow a part of markers to have no effects on a phenotypic value. Although simultaneous evaluation of markers and no need for marker selection are advantageous characteristics of GS, decreasing the number of markers required in the breeding phase might be preferable from the economic viewpoint. RECENT PROGRESS IN GS STUDIESThe most important factor determining the success of GS is the accurate prediction of GEBVs. The accuracy of the predicted GEBVs is often estimated based on the correlation between the observed phenotypic value and GEBVs. To produce accurate GEBVs, several studies have applied comparative statistical approaches to GEBV prediction.
In addition, simulations studies have been widely used to investigate the affect of the number of QTLs, markers, individuals and other variables. These studies were reviewed recently by and, and so are not described further in this section. Instead, we focus on recent progresses in GS based on empirical data to understand better the practical use of GS.
Animal scienceStudies of GS are more common in the field of animal science than plant science. The BV concept was used in animal breeding long before the emergence of GS, so the GS approach was more readily accepted by animal scientists. In addition, the lower diversity of the targeted species and fewer effects of environmental factors during the growing stage might have contributed to the rapid introduction of GS in animal science. The first empirical GS study in animal science was reported by Legara et al. (2008) using mice (Table ). A total of 1884 individuals were generated from eight inbred lines and genotyped using 10 946 single nucleotide polymorphism (SNP) markers, before predicting the GEBVs for four traits related to body sizes.
A comparison of the predictive ability and accuracy of GEBVs generated with or without SNP genotypes and polygenetic effects demonstrated that GW genetic evaluation and selection provided better accuracy and predictive ability than the classical polygenic model. SpeciesPopulation typeSize of test populationTraining: validating aNo. SpeciesPopulation typeSize of test populationTraining: validating aNo.
SpeciesPopulation typeSize of test populationTraining: validating aNo. SpeciesPopulation typeSize of test populationTraining: validating aNo. The most advanced progress in GS has been observed in dairy cattle. In Table, the results of three GS studies in dairy cattle are summarized (;; ). In addition to the three reports in Table, seven empirical GS studies of dairy cattle were also reported and reviewed by,. Of the three cattle studies in Table, a total of 500–5335 individuals were used for GEBV prediction using 18 991–38 416 SNPs.
GEBVs for various QTs related to milk production, cattle body size and fertility were predicted using several different methods, where the accuracy of GEBVs ranged from 014 to 069. And reported GEBV prediction in beef cattle. Parentally identified steers and sires of 2405 Angus cattle were genotyped using 41 028 SNPs in a study by, while an admixture population consisting of Angus, Charolais and hybrid bulls was genotyped using 37 959 SNPs for 721 individuals in a study. GEBVs for traits related to daily gain and daily intake were investigated, and the estimated accuracies ranged from −007 to 048. In chickens, tested 16 traits related to eggs and chicken body size with 23 356 SNP genotypes using 2708 individuals derived from a single blown egg-layer line. The accuracy of GEBVs estimated ranged from 02 to 07. In addition, reported GS studies on Salmonella carrier-state resistance in chickens (not shown in Table ).The populations used in the empirical studies mentioned above were usually divided into two, i.e.
Training and validating populations. Training populations were used to develop GS models based on genotypic and phenotypic data, whereas the validating populations were used for investigating the GEBV accuracy by estimating the correlation between the GEBVs predicted by the GS models and the observed phenotypic values.
Validation is not theoretically essential for a GS scheme (Fig. ), although it is practically important to confirm the adequacy of a GS model before moving onto the breeding phase. Of the seven studies listed in Table, five considered pedigree relationships when the populations were divided into training and validating populations (;; Van Raden et al., 2010;; ). Thus, these studies reflected the entire GS process better compared with the others, because the breeding phases in GS were demonstrated virtually by the verification of GS models using the progeny of the training populations.The reported studies used different materials and statistical methods for GEBV prediction, but many of these studies showed that the accuracy of GEBV was higher than that of traditional EBV and it was increased with a larger population size, larger numbers of genotyped SNPs, and higher heritability of the targeted traits. The details are not described here, but some of the studies compared different statistical methods for GEBV prediction. Note that the best approaches with the highest accuracy of GEBVs were different in each case (Table ).
The accuracy of GEBVs estimated in empirical studies fell below 07 (Table ), which was lower than that suggested by many simulation studies such as 085 in. Indicated that the distribution of QTL effects in real data is generally lower than that assumed in simulation studies.
If this is true, the lower accuracy estimated by real data might be affected by a lower number of QTL effects as well as other factors, such as the non-additive effects of QTLs and environmental factors. Plant sciencePlant breeding targets a diversity of species with different reproduction systems, generation times, genome structures and utilized organs. Thus, various methods are used in conventional breeding, i.e. PS and traditional MAS, to adapt to the demands of different targeted species and breeding objectives.
Like conventional breeding, GS should be adapted to the fit different types of plant species and breeding objectives.Reports on plant species that specified ‘genomic selection’ or ‘genomewide selection’ have been published since 2007. Simulated the efficiency of GS in a cross of inbred lines, which is common in plant breeding but not in animal breeding. However, no specific plant species was considered as the targeted species in this paper. Simulation studies of specific species were firstly published for maize , where a comparison between GS and marker-assisted recurrent selection (MARS) was demonstrated for three cycles of selection of doubled haploid lines (DHLs).
The response of GS was 18–43% greater than that of MARS with different numbers of QTLs (20, 40 and 100). Moreover, simulation studies using maize were performed to determine the advantages of using DHLs compared with F 2 populations in GS and MARS , and to develop a methodology for the rapid introgression of exotic germplasms in an adapted line of maize via GS. In addition to maize, two GS simulations were performed with the oil palm, which is an outcrossing species that requires 19 years for one cycle of (PS) , and with a self-pollinated crop, barley.While these studies simulated biparental cross populations, three studies also reported GS simulation using multiple inbred lines in barley based on real genotype data obtained mainly from SNPs and diversity array technology (DArT) (;; ). Compared the accuracy of four GS prediction methods that were affected by marker density, level of linkage disequilibrium (LD), QTL number, and sample size, where the level of replication in populations was generated using 42 multiple inbred lines of two-row spring barley with the genotypes of 1933 loci obtained from SNP, DArT and classical markers. They concluded that the GS prediction method with the highest accuracy changed with different levels of LD between the marker and QTLs, QTL effects, and generations of individuals.
Moreover, simulated the accuracy of GS using more large-scale data, consisting of 1325 SNPs in 863 breeding lines of barley derived from nine breeding programmes in the USA. Seven methods were used for GEBV prediction and the mean of the predictions in all methods was more accurate than predictions based on any single method under medium and high heritability. Simulated the dynamics of long-term GS using 192 breeding lines from an elite six-row spring barley programme with genotypes identified by 983 polymorphic markers. The results suggested that losing favourable alleles with weak LD with markers during selection cycles was inevitable, while placing additional weight on low-frequency favourable alleles was important for long-term GS.Investigations of the accuracy of GEBV predictions using empirical data have been reported for maize, barley, wheat and Arabidopsis thaliana (Table ). It was first demonstrated by for maize, A.
Thaliana and barley. All the test populations were generated from biparental crosses where the number of test progeny and markers ranged from 119 to 415 and 69 to 1339, respectively. Arabidopsis thaliana had the highest accuracy of GEBVs, although the number of polymorphic markers used for genotyping was the lowest. This study was followed by demonstrations of GS using empirical data in maize by, and, as shown in Table.
Compared the performance of nine models using a series of experiments with DHLs derived from a single cross conducted in five environments, and suggested the need to appropriately model genotype–environment interactions and to employ an independent estimate of error. Demonstrated GS using a genetically diverse population 300 lines bred in CIMMYT (The International Maize and Wheat Improvement Center) and 1148 SNPs, with a predicted accuracy of GEBVs ranging from 042 to 079 by ridge regression BLUP.
The largest-scale analysis of maize was performed by, which used 4699 progeny derived from 25 nested association mapping populations with genotypes for 1106 SNPs. While a common line, ‘B73’, was used as the maternal line across the 25 mapping populations, the paternal lines were all different. Interestingly, the accuracy of the predicted GEBVs was different in the 25 crosses, although the study used almost the same SNPs, targeted traits and population sizes.
SpeciesPopulation typeSize of population usedTraining population ratio.No. SpeciesPopulation typeSize of population usedTraining population ratio.No. SpeciesPopulation typeSize of population usedTraining population ratio.No. SpeciesPopulation typeSize of population usedTraining population ratio.No. GS studies using empirical data from wheat were first reported by using 1279 DArT genotypes and 599 wheat lines bred in CIMMYT. The targeted trait was grain yield and GEBVs predicted by reproducing kernel Hilbert spaced regression ranged from 048 to 061. In addition, reported empirical results for wheat using 209 (CC population) and 174 (FKQ population) progeny of DHLs of biparental crosses with 399 and 574 polymorphic genotypes, respectively.
The accuracy of GEBVs in the CC and FKQ populations ranged from 032 to 084 and 041 to 073, respectively (RR-BLUP, sample size was 96).GS of perennial crops is considered to be more effective than annual crops because of their long generation times. GEBV predictions based on empirical data were presented for Loblolly pine and eucalyptus at the IUFRO Tree Biotechnology Conference 2011 (Table;;; ). All cases used full-sib families as test populations and the number of individuals ranged from 149 to 920.
In the two studies of Loblolly pine, 3406–3938 SNP markers were used for genotyping, while 3120–3564 DArT markers were used in the study of eucalyptus. The GEBV accuracy of all studies ranged from 03 to 077.Interestingly, the ranges of accuracies in empirical studies were higher in plant studies than animal studies, although most plant studies employed lower numbers of genotyping markers. This might be due to the lower genetic diversity caused by a small number of parental lines and a greater bottleneck in the breeding materials.
Note that the numbers of markers used for woody species was higher than that used for annual plant species. Empirical plant GS studies show that GS is a potential method for plant breeding and that it can be performed with realistic sizes of populations and markers when the populations used are carefully chosen. THE PRACTICE OF GS IN PLANT BREEDING Linkage disequilibrium (LD)LD has a major affect on the operability of GS, so it has to be well understood before performing GS. LD is defined as the non-random association of alleles at different loci. The intensity of LD between two loci is measured based on the frequency of alleles, using indexes such as D, D' and r 2, and it ranges from completely random ( D = D' = r 2 = 0) to complete LD ( D = 025D' = r 2 = 1). The LD intensity decays with greater distance between two markers. Although it is difficult to delineate, a significant LD intensity is commonly considered to be r 2 01 (;; ).
In general, the distance between two markers with significant LD intensity ( r 2 01) is found to be greater in outcrossing species than selfing species, although it varies with different species, population structure and genome regions. For example, observed marker intervals with significant LD intensity in outcrossing species are reported to be 100–150 bp in Loblolly pine, 500 bp in grape and 04–70 kbp in maize, whereas those in selfing species are 50 kbp in soybean, 100 kbp in rice and 250 kbp in A. Thaliana (reviewed by ).The number of markers required for GS modelling is determined based on the marker interval with a significant LD intensity in targeted populations.
In a case of Loblolly pine, the genome size exceeds 20 Gbp and the marker interval with a significant LD intensity was between 100 and 150 bp in 435 unrelated individuals. If the 435 individuals were used for GS modelling, the number of markers required would be at least 200 M (20 Gbp per 100 bp). However, significant GEBVs with 03–083 accuracy were obtained using 3406–3938 SNPs in full-sib families with Loblolly pine (Table;; ).
This large disparity in the number of required markers is caused by the different length in the marker interval with a significant LD intensity in an unrelated mapping population and full-sib families. In other words, employing a population that originated from a few parental lines is effective in reducing the number of markers required, especially for species whose LD intensities decay rapidly among unrelated individuals (see Fig. ).
Variation of LD intensity in different populations of a single species. (A) Allele frequency and LD indexes ( r 2) between marker I and others in an unrelated population. Roman numerals represent markers mapped on a linkage group with 20-cM intervals.
The two allele types, white and black, are represented in white and black. White allele freq. Means the frequency of white alleles for markers II–V, in each case where the marker I allele is white or black. In this example, the white allele frequencies of markers II, III, IV and V are all 05, while the LD indices ( r 2) between marker I and other markers are all zero (completely random). (B) A population of clonally propagated individuals. Assume that an individual is selected from an unrelated population (outlined in blue in population ‘A’) and clonally propagated. All individuals in population ‘B’ share the same genotype.
Thus, the r 2 between marker I and the other markers are all 10 (complete LD). (C) Suppose two individuals are selected from population ‘A’ (outlined in blue and red) and RILs (recombinant inbred lines) are developed based on a cross between the two individuals. Recombination occurs during meiotic division in the F 1, so the white allele frequency varies depending on the distances between marker I and other markers. Then, LD decays are observed in the RILs. Variation of LD intensity in different populations of a single species.
(A) Allele frequency and LD indexes ( r 2) between marker I and others in an unrelated population. Roman numerals represent markers mapped on a linkage group with 20-cM intervals. The two allele types, white and black, are represented in white and black. White allele freq. Means the frequency of white alleles for markers II–V, in each case where the marker I allele is white or black. In this example, the white allele frequencies of markers II, III, IV and V are all 05, while the LD indices ( r 2) between marker I and other markers are all zero (completely random). (B) A population of clonally propagated individuals.
Assume that an individual is selected from an unrelated population (outlined in blue in population ‘A’) and clonally propagated. All individuals in population ‘B’ share the same genotype.
Thus, the r 2 between marker I and the other markers are all 10 (complete LD). (C) Suppose two individuals are selected from population ‘A’ (outlined in blue and red) and RILs (recombinant inbred lines) are developed based on a cross between the two individuals. Recombination occurs during meiotic division in the F 1, so the white allele frequency varies depending on the distances between marker I and other markers. Then, LD decays are observed in the RILs. Relationship between training and breeding populationsIn traditional MAS, a marker that is confirmed to have tight linkage with a target QTL or gene can be used as a selection marker in most breeding populations of that species.
Therefore, breeders have not had to seriously consider the relationship between mapping populations and breeding populations. However, in GS, the relationship between training and breeding populations must be carefully considered with the single exception of a marker set where adjacent markers have significant LD intensities across unrelated individuals in a pool of breeding materials genotyped for the training populations.Suppose that two pairs of lines used for biparental crosses are selected from a pool of breeding materials (Fig. ). The genotypes of the flanking markers (II and IV) of a targeted gene/allele (yellow-coloured G) are ‘white’ in cross 1, while those in cross 2 are ‘black’.
This indicates that allele types with significant LD with the targeted genes are not kept across different crosses. When this happens in traditional MAS, we usually have to explore the markers nearest to the targeted genes to avoid false positive selection.
However, because GW markers are used in GS, it is almost impossible or meaningless to explore the nearest markers to each GW marker. Thus, establishing a GS model based on a training population does not work in a breeding population if the genetic structures of both populations are different, except for the case described in the preceding paragraph. Indeed, in most reported GS studies, the training populations were assumed to consist of ancestors or randomly selected individuals in a breeding population.
Reported that SNP estimates calculated from a Holstein–Friesian training population did not produce accurate GEBVs in a Jersey population. Simulated the accuracy of GEBVs in admixed and cross-bred livestock populations, and found that the accuracy was greatly reduced when genes from the target pure breed were not included in the admixed and cross-bred population. Allele types of flanking markers for a targeted gene. Roman numerals represent the markers (I, II, IV and V) mapped on a linkage group. ‘G’ indicates a targeted gene.
Distances between adjacent markers and the gene G are 20 cM. White and black represent the allele types of the markers, while grey and yellow indicate the allele types of a targeted gene. Suppose that the yellow allele is a favourable genotype on a targeted gene G. The LD between gene G, marker II and marker IV is completely random in a pool of breeding materials (unrelated population) while significant LD ( r 2 = 08) is observed in RILs developed from biparental crosses (1 and 2), as shown in Fig. When the two individuals outlined in red are selected for a biparental cross (B: cross 1), the genotypes of the flanking markers (II and IV) linked to gene G/yellow are white. By contrast, when the two individuals outlined in blue are selected for a biparental cross (C: cross 2), the genotypes of the flanking markers (II and IV) linked to gene G/yellow are black. This example indicates that the allele types with significant LD with the targeted genes are different between the two crosses.
Allele types of flanking markers for a targeted gene. Roman numerals represent the markers (I, II, IV and V) mapped on a linkage group. ‘G’ indicates a targeted gene. Distances between adjacent markers and the gene G are 20 cM.
White and black represent the allele types of the markers, while grey and yellow indicate the allele types of a targeted gene. Suppose that the yellow allele is a favourable genotype on a targeted gene G.
The LD between gene G, marker II and marker IV is completely random in a pool of breeding materials (unrelated population) while significant LD ( r 2 = 08) is observed in RILs developed from biparental crosses (1 and 2), as shown in Fig. When the two individuals outlined in red are selected for a biparental cross (B: cross 1), the genotypes of the flanking markers (II and IV) linked to gene G/yellow are white. By contrast, when the two individuals outlined in blue are selected for a biparental cross (C: cross 2), the genotypes of the flanking markers (II and IV) linked to gene G/yellow are black. This example indicates that the allele types with significant LD with the targeted genes are different between the two crosses. Population sizeSeveral reports of simulation and empirical GS studies suggest that a larger training population size improves the accuracy of GEBV predictions. For example, reported that the average ratio of GS accuracy to PS accuracy for grain quality traits in biparental wheat populations containing 174 or 209 individuals were 066, 054 and 042 for training population sizes of 96, 48 and 24, respectively. The ratio of the number of individuals in the training to the breeding populations varied in different studies.
For example, it ranged from 008 to 100 in empirical studies of plants (Table ). Although the appropriate ratio varied depending on the genetic diversity, population size, heritability of traits and the number of QTLs, it can be suggested that a higher training: breeding population ratio is required with greater genetic diversity, smaller-sized breeding populations, lower heritability of traits and larger numbers of existing QTLs to obtain GEBVs with high accuracy. In addition, the balance of the population size and the genotyped marker is also important. When the population size is small and the genotype data are large, this often causes an overestimation of the genotype effect, which exaggerates minor flux in the data, i.e. The ‘large p, small n’ issue.The empirical studies indicated that the sizes of training populations in plant GS studies were often smaller than those of animal studies (Tables and ). Two factors are expected to affect the size differences of training populations. The first factor is the narrow genetic diversity of plant populations, which is mainly caused by self-crossing reproduction and/or the smaller number of parental lines used for generating tested populations (biparental crosses have often been used).
Because populations having greater genetic diversity require larger population sizes to obtain GEBVs with high accuracy , smaller sizes of training populations are used in plant GS studies, especially for self-crossing reproduction species and/or biparental cross-derived populations. The second factor is the existence of a large quantity of legacy data about the phenotypes of pedigrees, which have been used to estimate traditional BVs in animal breeding. The accumulated phenotype data should make performing GS studies possible with low cost. As with animal studies, pooling phenotypes of plant populations in which multiple regions have been investigated would be a promising approach for achieving success in plant GS studies, satisfying both high-accuracy GEBV and low experimental cost. Number of markersGenerally, a greater number of markers is required for a population where the marker intervals with a significant LD intensity are shorter. In addition, empirical and simulation studies suggest that a larger number of markers improves the accuracy of GEBVs.
For example, found that the simulated accuracy of GEBVs was improved by increasing the marker density from 025 to 8 SNP markers per centimorgan in 100 unrelated animals. Furthermore, in an inbred population derived from a biparental cross, demonstrated that the response of the GEBV improves when decreasing the adjacent marker intervals from 280 to 70 cM, whereas no differences were observed with marker intervals of 70, 35 and 23 cM when the total length of the linkage map was assumed to be 1794 cM. The heritability of targeted traits is also affected by the relationship between the density of markers and the accuracy of GEBV. Demonstrated that an adjacent marker r 2 of 015 was sufficient for a trait with a heritability of 50%, while the GEBV accuracy was improved by increasing the r 2 to 02 for a trait with a heritability of 10%. However, we have to consider that too many markers often leads to a loss in GEBV accuracy, as described in the section on population size.One of the obvious differences between GS and traditional MAS is the number of markers required for genotyping in a breeding population. In most GS studies, the whole set of markers used in the training population was also applied to the breeding or validating population.
For example, suppose that the numbers of individuals in the training and breeding populations are 200 and 1000, respectively, and that the number of genotyped markers is 1000, then the genotype data points are 200 × 1000 in the training population and 1000 × 1000 in the breeding population. This is quite different from traditional MAS, which requires a few selected marker genotypes that are related to a targeted trait in the breeding phase, except when investigating the GW genetic backgrounds of a breeding population in MARS.Several reports suggest that advances in genotyping technologies will resolve the cost issue of the large number of genotype data points required in a breeding population and this idea might be correct. However, it is still necessary to conduct large-scale genotyping when performing GS in many breeding programmes, especially for non-major crops. To overcome this obstacle, several studies have used decreasing numbers of genotyped markers.
For example, proposed a panel of evenly spaced low-density SNPs for tracking the effects of high-density SNP alleles within families based on the utilization of cosegregation information. Determined the imputation scores of untyped markers in a low-density genotyped panel by referencing a high-density panel in barley. Both studies were based on a common idea of predicting the interval genotypes of a population using low-density allelic data. By contrast, discussed the performance of GEBV prediction by reducing the density of marker panels.
It was found that low-density and evenly spaced SNPs performed poorly when predicting GEBV, whereas SNPs selected based on their additive-effect size yielded accuracies similar to those at a high density.proposed a model for a genomic selection breeding programme, which consisted of a model training cycle and a line development cycle. It suggested that the most immediate impact of selecting an elite line by GEBVs would be a marked increase in the speed of the cycles. Shorter selection cycles of populations would lead to a rapid change of genetic diversity in the breeding populations and would affect GEBV accuracy during long-term selection. In addition, novel recombinants generated during selection cycles would cause LD decay between markers and QTLs.
This would be a more serious issue when lower-density markers are used for GEBV predictions. And surveyed the dynamics of long-term selection responses by performing simulation studies, and concluded that GS leads to a more rapid decline in the selection response than PS unless new markers are continually added to the prediction of breeding value. They also suggested that placing additional weight on low-frequency favourable alleles, especially at the beginning of GS, was important for maximizing the long-term response in GS. Types of markersMost GS studies use SNP, DArT and simple sequence repeat (SSR) markers for genotyping. Results based on other types of markers in high-throughput genotyping systems will be reported in the near future, such as restriction site-associated DNA (RAD) and genotyping by sequencing (GBS) marker systems.
The DArT, RAD and GBS markers identify polymorphisms by hybridization or sequencing digested DNAs using restriction enzymes, so they are dominant markers except in the case where high coverage genome data are obtained for each individual using RAD and GBS markers. Showed that the LD detection power of a dominant marker is less than that of a co-dominant marker, and it was improved with a three-locus LD analysis.
The results suggest that dominant markers lead to a lower accuracy of GEBV prediction than co-dominant markers and employing haplotypes would improve the accuracy.DNA markers are also categorized as bi-allelic markers and multi-allelic markers. The former includes SNP, DArT, GBS and RAD markers while the latter include SSR, RAPD (random amplified polymorphic DNA) and RFLP (restriction fragment length polymorphism) markers. Demonstrated the accuracy of GEBV prediction with SNP and SSR markers using 100 unrelated animals, and concluded that the SNP markers required two to three times greater density compared with using an SSR marker to achieve similar accuracy.
With a bi-allelic marker, additional consideration must be given to the genetic sources used in marker development. Compared Australian and Bovine HapMap samples, and found differences in the presumptive selective signatures when different breeds or SNPs were used. Based on these results, they suggested that using the same SNP is necessary when comparing the selection signatures among studies.RAD and GBS marker systems that can scan GW polymorphisms in de novo would bypass the need for prior marker development and rather allow direct genotyping of the training and breeding populations. TraitsThe main advantage of MAS is considered to be the lack of a requirement for phenotyping during selection cycles.
More strictly, there is no need for the phenotyping of traits that were previously investigated in a training population. In conventional breeding, multiple expressed traits are investigated during a whole growing period and after harvesting. Thus, all traits of interest to breeders should be investigated during the training phase to exclude phenotyping during the breeding cycle if the gain of ‘selection’ is regarded as equivalent between MAS and conventional breeding. In traditional MAS, only a few selected markers are used during the breeding phase, whereas GW genotypes are used in GS.
Therefore, MAS for multiple traits is performed more systematically in GS, because there is no need to change the marker set used during the breeding phase. To our knowledge, no reports have been published on the selection of traits where trade-off relationships are observed in breeding materials, such as stress tolerance and quality. However, we considered that the selection of trade-off traits in GS will be a major issue in the near future, because the use of GW genotypes might help break up the trade-off relationships of targeted traits, although current GS does not consider the weight of each marker effect in the result. Breeding scheme with GSAccording to published reports, GS is not assumed to be a perfect replacement for PS in plant breeding and instead it is proposed as a method for accelerating part of a whole breeding programme. For example, proposed using GS during the off-season for the selection of random mating DHLs that are pre-selected for their test-crossing ability in the regular season by PS.
By contrast, ) and proposed using GS for parental selection to generate the breeding population in the next selection cycle. For example, 288 inbred (F 5) lines of winter wheat were assumed to be created and genotyped by single-seed descent. F 5-derived lines were grown in the field to increase the seeds, which were then selected for advanced testing based on their phenotypes and GEBV. Small numbers of F 5-derived lines were selected based on GEBV to start recombining for the next cycle. In addition, phenotypic data from F 5-derived lines were used for GS modelling of the next cycle. The proposed scheme suggested that GS fits well with recurrent selection approaches that are not usually employed in the conventional breeding of selfing crop species. Interestingly, also proposed using traditional MAS for important QTLs in the F 2 and F 3 generations, before GS in the F 5 generation.
This eliminates unnecessary marker scoring and greenhouse space for lines that do not carry essential QTL alleles. These propositions suggest the importance of flexible GS introduction into breeding programmes and combining it with other approaches, i.e. Traditional MAS and PS. Computer package for GS modellingAn R-Package for GS is available on.
No user-friendly software has yet been developed, such as QTL Cartographer and MapQTL that are used in QTL analysis. The development of a user-friendly software package is required to enhance the general application of GS. FUTURE PERSPECTIVES IN GSLike conventional breeding and traditional MAS, GS cannot be used for the selection of low heritability (in the narrow sense) traits.
Narrow-sense heritability is defined as the ratio of the genetic variance of additive genetic effects to the phenotypic variance. Thus, low heritability traits are caused by the high variance of non-additive genetic effects, such as environmental factors, G × E interactions, and dominant and epistatic genetic effects.Previous studies of QTL identification suggest that the magnitudes of G × E are unequal, with some QTLs expressed in all tested environments and others expressed in a particular environment. With GS, indicated that different animals tend to be selected for the two environments when the genetic correlation between production in the two environments is.