Increased brain size is thought to have played an important role in the evolution of mammals and is a highly variable trait across lineages. Variations in brain size are closely linked to corresponding variations in the size of the neocortex, a distinct mammalian evolutionary innovation. The genomic features that explain and/or accompany variations in the relative size of the neocortex remain unknown. By comparing the genomes of 28 mammalian species, we show that neocortical expansion relative to the rest of the brain is associated with variations in gene family size (GFS) of gene families that are significantly enriched in biological functions associated with chemotaxis, cell–cell signalling and immune response. Importantly, we find that previously reported GFS variations associated with increased brain size are largely accounted for by the stronger link between neocortex expansion and variations in the size of gene families. Moreover, genes within these families are more prominently expressed in the human neocortex during early compared with adult development. These results suggest that changes in GFS underlie morphological adaptations during brain evolution in mammalian lineages.
Increased brain size in mammals when compared with other vertebrate taxa is thought to have played an important role in the expansion of this clade. Increased brain size during evolution has been previously related to increased behavioural complexity and the ability to cope with a changing environment [1,2]. However, the precise evolutionary drivers of brain size expansion in mammals and its relation to behavioural ability are still unclear and remain a topic of much interest and debate. This is complicated by the fact that different mammalian clades have differences in the degree of size-related changes in brain tissue . Generally, large brains differ from small brains in having larger neuronal soma sizes , increased numbers of non-neuronal cells, in particular glia [5,6], and lower overall neuron density . Large brains, however, are associated with a high metabolic cost [8–11] as well as higher demands of parental investment and delayed sexual maturation [12–16].
Brain size is a highly variable trait among mammalian and non-mammalian species with marked differences observed even between relatively close species [17–21]. Because brain size is closely associated with variations in body mass across species , comparative studies of brain size frequently use a corrected measure of brain size, known as encephalization index (Ei), which provides a measure of how much brain size is above (or below) what is expected for a given body size. While Ei is commonly regarded as an index that aligns more closely with behavioural capacity [1,23,24], many studies have also related behavioural complexity directly to the actual size of specific brain regions as well as to relative brain size as a whole [25–29]. Changes in relative brain size (or Ei) on the other hand, are not necessarily the result of a proportional expansion of all brain structures. In many mammalian lineages, most variations in encephalization index are closely linked to changes in the size of the neocortex [30–32], a distinctive structure of the mammalian brain and one of the most salient evolutionary innovations of the mammalian lineage [33–36].
The characteristic increase in the size of the neocortex relative to the rest of the brain has long been considered one of the primary targets of selection during mammalian brain evolution [37–39]. Increases in the absolute size of the neocortex are related to an increase in the number of functionally distinct neocortical areas [40–42], potentially allowing more complex information processing and the emergence of new behaviours . In comparative studies in primates, for instance, relative size of the neocortex has been correlated with social group size [12,44,45] (but see ), and it has been speculated that the number of neocortical neurons may be a limiting factor in determining the number of social relationships mammals can effectively establish and manage . More neocortical areas may be found in larger brains due to the lower marginal cost of devoting additional neural tissue to increasingly specialized functions , and an increasing number of neocortical areas may facilitate a more elaborate processing of sensory and motor information [48,49].
In the hominid lineage, the expansion of the neocortex is thought to have played a key role in the evolution of modern humans , including specialized areas involved in processing and production of language [51,52] as well as areas involved in identification of faces [53,54] and locations [55,56]. The neocortex in humans is widely regarded as the primary seat for the so-called higher cognitive functions, including self-awareness, consciousness, abstract reasoning and planning [57–63]. Development of the neocortex extends well into adolescence in humans and, although the structure of the layers in the neocortex is established during early prenatal development , the neocortex keeps growing in childhood and adolescence, reaching a peak in thickness on average at around 13 years of age, while myelination of some cortical regions can still continue after 20 years of age .
Despite the importance of the neocortex, the genomic features underlying its expansion during mammalian evolution remain poorly understood [30,66,67]. So far, there have been few efforts to identify features reflecting the genomic impact of brain evolution. Dorus and co-workers  reported accelerated sequence evolution of genes functioning in the nervous system during human origins, but this pattern was contested by later studies [69,70]. A genome-wide analysis of amino acid composition across 37 fully sequenced mammalian genomes showed that encephalization is significantly correlated with overall protein amino acid composition, although the causes of this pattern remain unclear .
Changes in gene family size (GFS) can reflect changes in the relative relevance of specific functions in an organism. Gene duplication events have been proposed to play a major role in the origin of novel gene functions and expression patterns [72,73]. Marked differences in GFS have been identified in Drosophila and vertebrates, with families experiencing the largest changes being enriched in distinct biological functions [74–76]. Among mammals, marked differences in the number of olfactory receptors are likely to reflect variations in the reliance of different lineages on their sense of smell [77–80]. A recent study found that encephalization in mammalian lineages is associated with significant variations in GFS, with a significant enrichment of genes associated with immune system response, chemotaxis and cell–cell signalling functions among the most positively associated gene families . Here, we investigate if variations in the relative size of the neocortex or neocortex to brain size ratio (Nr) are associated with changes in GFS in mammalian lineages, and whether the extent to which any changes in GFS associated with Nr could explain previously reported associations between GFS variations and encephalization. We further explored whether any associated correlations between Nr and GFS are functionally reflected by the specific patterns of expression of Nr-associated families in the developing neocortex in humans.
2. Material and methods
2.1. Gene family annotations
Annotated gene families encompassing 28 fully sequenced mammalian genomes were obtained from Ensembl release 76  (http://www.ensembl.org. Ensembl release 76). In the context of this annotation, Ensembl families are defined by clustering all Ensembl proteins along with metazoan sequences from UniProtKB. Any given gene family constitutes a group of related genes that includes both paralogues within the same species and orthologues and paralogues from other species. Any given gene can only be assigned to a single gene family. GFS in a given family for a given species was calculated as the total number of member genes contained in that gene family, for that particular species. In this study, we included all gene families with members present in at least six of the 28 mammalian species (n = 11 943). We excluded from this study any gene family with no variance in GFS across species.
2.2. Phenotype data
Body mass-corrected values of brain mass, known as encephalization index (Ei), were computed as The slope (b) was estimated as 0.64  based on a log–log least-squares linear regression of brain mass against body mass data from 493 mammalian species (table 1). Neocortex volumes were compiled from available literature (table 1), and include the grey and white matter of the cerebral cortex. Grey matter from palaeocortical structures (entorrhinal cortex, schizocortex, hippocampus and amygdala) were excluded. Nr was defined as after Dunbar . Maximum lifespan (MLSP) for each species was obtained from the animal ageing and longevity database, AnAge . Brain region volumes and corresponding sources as well as encephalization indexes and MLSP for all included species are shown in table 1.
2.3. Correlation coefficients of gene family size and different phenotypes
Pearson's correlations between GFS values and the three phenotypes, Ei, Nr or MLSP, for all 11 943 gene families included in the study, were calculated using R-based statistical functions. To determine the statistical significance of any potential shift in the distribution of Pearson's correlation coefficients when compared to random expectation, 10 000 Monte Carlo simulations of the expected distribution based on random permutations of GFS values across species were conducted and contrasted with the observed distribution of correlation coefficients using a Z-score test.
2.4. Confounding variables and phylogenetically controlled correlations
In order to remove the effect of Ei and MLSP on Nr, we calculated residuals for the multivariate regression of Nr ∼ Ei + MLSP (with Nr as the response variable and Ei and MLSP as independent covariates). For consistency, we used the exact same approach to obtain similar corrected estimates for all GFS values after correcting for any potential effects of Ei and MLSP. This was done by extracting the residuals for the multivariate regression GFS ∼ Ei + MLSP for each individual gene family. The resulting sets of residuals where then used to obtain phylogenetic independent contrasts (PIC) to further account for any effect of phylogenetic relationships on these variables . The resulting independent contrasts were finally used to assess the final corrected association between Nr and GFS by simply using standard Pearson's correlations forced through the origin. The same analysis was carried but using Nr and MLSP instead as independent covariates to generate residuals for all Ei and GFS values, from the multivariate regressions Ei ∼ Nr + MLSP and GFS ∼ Nr + MLSP respectively, followed by extraction of the corresponding PIC to assess the unbiased association between Ei and GFS (figure 1). PIC analysis was computed using the ape package in R . Ultrametric phylogeny of the 28 analysed mammalian species was obtained from TimeTree database  (http://www.timetree.org/. TimeTree2).
2.5. Gene Ontology term enrichment
Gene Ontology (GO) annotations per species for biological process domains were obtained from Ensembl's Biomart release 76 . A GO term was associated with a family whenever that term was linked to any of its members in any species. To minimize the effect of very small functional categories, only terms linked to at least 200 gene families were examined (n = 116). GO terms with less than 200 gene families were assigned to a ‘small biological process GO terms’ category while gene families not annotated to any GO term in any species were grouped into a ‘not annotated’ category. Enrichment analysis of these GO terms was carried out as described in Castillo-Morales et al. . Briefly, over-representation of genes associated with specific GO terms was assessed by counting the number of gene families assigned to each GO term within the analysed set of gene families. Statistical significance was numerically assessed by obtaining the expected number of families per GO in 1000 equally sized random samples derived from the overall population of gene families. Because genes vary in the number of GO terms associated with them, we adjusted for differences in the density of GO annotations between the test and background samples, by dividing the family counts per GO from each sample by the sample's average number of GO annotations per family.
2.6. Gene expression before and after full cortical maturation
RNAseq RPKM normalized expression data summarized to genes were obtained from the NIMH Transcriptional Atlas of Human Brain Development database  (http://brainspan.org. BrainSpan Atlas of the Developing Human Brain) for a total of 143 post-mortem human brain samples corresponding to 11 cortical regions across 13 different ages. The cortical regions include primary auditory cortex (core) (A1C), dorsolateral prefrontal cortex (DFC), posteroinferior (ventral) parietal cortex (IPC), inferolateral temporal cortex (area TEv, area 20) (ITC), primary motor cortex (area M1, area 4) (M1C), anterior (rostral) cingulate (medial prefrontal) cortex (MFC), orbitofrontal cortex (OFC), primary somatosensory cortex (area S1, areas 3,1,2) (S1C), posterior (caudal) superior temporal cortex (area TAc) (STC), primary visual cortex (striate cortex, area V1/17) (V1C) and ventrolateral prefrontal cortex (VFC). The samples covered developmental stages 16, 24, 37 post-conception weeks, four months after birth and 1, 3, 8, 13, 19, 21, 30, 36 and 37 years old. Gene expression data were further normalized against the total expression per sample, and divided into two developmental groups, corresponding to the periods before and after full maturation of cortical thickness, which occurs at about 13 years of age in humans . For each gene, expression levels were averaged across stages and structures of the same developmental window and comparisons between developmental windows were carried out by means of paired Wilcoxon tests.
In order to assess the association between gene family size, GFS, and neocortex expansion, Nr, values were compiled from the literature for 28 mammalian species with fully sequenced genomes (table 1). GFS was calculated for a total of 11 943 non-overlapping families. Pearson's correlation coefficients between GFS and Nr were then calculated for each gene family. We found a significant over-representation of gene families with positive associations between GFS and Nr (figure 1) (χ2 = 2973.263083, p < 1 × 10−20). A Monte Carlo simulation showed that the observed shift in the distribution towards positive values is statistically significant when compared with random expectation (Z-score for observed mean R = 2.225819868, p = 0.013).
In order to assess whether the observed bias towards strong correlations between GFS and Nr preferentially involves gene families associated with specific biological functions (as opposed to random sets of gene families), we assessed the statistical over-representation of functional annotations (annotated GO terms per gene family, see Material and methods) for the 440 gene families found to be significantly associated with Nr (rNr, GFS > 0 and FDR < 0.05). A total of 18 GO functional categories were found to be significantly enriched (FDR < 0.05) among Nr-associated gene families including immune response, negative regulation of endopeptidase activity, chemotaxis, cell–cell signalling, neuropeptide signalling pathway and G-coupled receptor signalling pathway (figure 2). Notably, genes with no functional annotations showed the highest over-representation.
As Nr is known to be highly correlated to relative brain size, the observed association between Nr and changes in GFS could be explained in principle by a previously reported association between GFS and relative brain size . Indeed, after calculating correlation coefficients between GFS and Ei (a commonly used index of brain size relative to body mass) for each gene family in the same set of 28 species used in this study (figure 1), we also found a significant shift in the distribution favouring positive associations. This finding is consistent with a previously reported study using a larger set of 39 species . The shift in the distribution of correlation coefficient values, however, was found to be stronger for Nr compared with Ei with the significance of the deviation for the latter being lower, (ZEi = 1.70943, p = 0.044, figure 1). Functional annotation enrichment analysis revealed a total of 17 GO term categories enriched among the set of gene families found to be significantly associated with Ei (rEi, GFS > 0 and FDR < 0.05), with a strong overlap with the 18 GO functional categories found overrepresented among Nr-associated gene families (Jaccard index = 0.67) (figure 2).
To assess whether variations in GFS associated with Nr are secondary to the relationships between Ei and GFS, we obtained Ei-corrected residuals for Nr and GFS. In addition, due to a known relationship between encephalization and MLSP in mammals we also corrected for the potential effect of this trait [83,105]. Finally, in order to remove any phylogenetic signal from the correlations between our traits of interest and GFS arising from interrelatedness among species, we used the above Nr and GFS residuals to conduct a PIC analysis (see Material and methods). This phylogenetically corrected analysis of GFS and Nr residuals revealed a total of 272 families significantly associated with Nr after correction for multiple testing (phylogenetically controlled r's > 0, FDR < 0.05; electronic supplementary material, table S1). By contrast, phylogenetically controlled correlations between equivalent GFS and Ei residuals (correcting for the effect of Nr and MLSP, see Material and methods) resulted in no gene families with a significant association after correcting for multiple testing. Nr-associated gene families after this correction against confounding variables were found to be enriched in GO terms including immune response, negative regulation of endopeptidase activity, chemotaxis, cell–cell signalling and neuropeptide signalling pathway (figure 2).
If the association between GFS and Nr responds to the functional demands imposed by the development of a large neocortex, we should expect genes associated with families displaying a high correlation with Nr to also display a pronounced level of activity prior to full cortical maturation (when full cortical thickness is reached), compared with later stages. To this end, we used available gene expression data derived from human neocortex obtained from the BrainSpan Atlas of the Developing Human Brain  (see Material and methods). We found that gene members of this set of families showed higher expression levels during human development prior to the neocortex reaching maximum thickness (which in humans occurs around the age of 13 years) compared with later stages, reflecting a transcriptional signature of the potential involvement of some of these genes in the development of the neocortex (p = 0.00013).
The expansion of the neocortex observed in several mammalian lineages is considered to be linked to a proliferation of new cortical areas driving increased cognitive capabilities [40,106]. The genomic drivers shaping the evolution of the brain and its morphology remain, however, poorly understood. By comparing the genomes of 28 mammalian species, here we have assessed the potential association between changes in GFS and the expansion of the neocortex. We show that neocortical expansion is indeed strongly and specifically associated with variations in GFS in mammals. Furthermore, variations in relative neocortical size account for a high proportion of the previously reported links between GFS and changes in encephalization across mammalian species . This suggests that changes in GFS in line with relative brain size in mammals are actually secondary to the underlying correlation between neocortex size and encephalization. Analysis of available human neocortex gene expression data revealed that genes in families strongly and specifically associated with neocortex size variations also show significantly higher levels of expression at stages of development before (but not after) maximal cortical development is reached in humans, thereby supporting a functional role for these gene families in the ontological development of a large neocortex. Among the 272 gene families whose size was found robustly correlated with relative neocortex size, even after correcting for encephalization, MLSP and phylogenetic relationships, 16 distinct biological functions (GO terms) were found to be significantly overepresented. Among these, cell–cell signalling and chemotaxis are known to play critical roles in the development and maintenance of the nervous system. Example of Nr-associated gene families annotated to these functions are the tyrosine kinase precursor family (ENSFM00730001521921), encoding receptor protein-tyrosine kinases and widely known to promote cell survival, proliferation, adhesion and migration in the central nervous system [107–109]. The leukotriene B4 receptor 2 family (ENSFM00680001303697) includes leukotriene B4, a proinflammatory signalling molecule which has been shown to mediate regulation of neural stem cell proliferation and differentiation .
Several immune-related biological functions (inflammatory response, defence response to bacteria, immune response, defence response and positive regulation of I-κB kinase/NF-κB signalling) were also enriched among Nr-associated gene families. Along these lines in recent years, numerous immune-related signalling and regulatory components have also been shown to play key physiological roles in the developing and adult nervous system (for a review see ). This involvement of individual immune-related signalling components in neural functions has been shown to be part of a wider genetic network of immune-related molecules acting as an intrinsic component of the neural-specific regulatory machinery that ultimately shapes the normal development of the nervous system . Thus, for instance, members of the tumour necrosis factor (TNF) receptor superfamily (ENSFM00500000273041, a gene family found to be highly associated with neocortex expansion here), are themselves part of the extensively studied canonical pathway of activation of the transcription factor NF-κB during early development of the nervous system .
Interestingly, gene families with no reported functional annotations for any of its members in any species showed the highest enrichment among the gene families with the highest positive associations with relative neocortex size. Among these families, we found the neuroblastoma breakpoint gene family (ENSFM00250000000879), whose members contain DUF1220 domains. DUF1220 domains have been previously linked to brain and cortical expansion in primate species, particularly in the human lineage [113,114]. Polymorphic deletions and duplications of DUF1220 domains have been associated with brain size variations in normal individuals from different human populations as well in pathological cases including microcephaly and macrocephaly [115,116]. Moreover, it has been proposed that proteins containing this domain have an important role during cortical neurogenesis, as they promote proliferation in neural stem cells , and during normal development they are expressed in the sub-ventricular zone precisely during the period of cortical neurogenesis .
Of particular importance to build a larger neocortex is the control of successive rounds of proliferation during early development, where the interplay between symmetric and asymmetric cell division is thought to be critical in shaping the particular morphology of the neocortex . Consistent with this, one gene family with significant GFS changes in line with increased relative neocortex size is the ENSFM00250000003440 gene family of epithelial cell adhesion molecules, which in turn include known human developmental regulators such as EPCAM and TACSTD2. EPCAM has been shown to be involved in cell proliferation, differentiation and migration in diverse cell types [118,119] and could thus play an important role in neocortex development. A more numerous gene family found was the speedy gene family (ENSFM00740001589497), which encodes proteins able to bind CDKs but having no similarity with cyclins, and some of its members are known to play a role in the regulation of cell cycle [120,121]. While a great variation in gene numbers across species has been documented in this family , here we report the first evidence of a strong association between these variations in this family and relative neocortex size in mammals.
In summary, we have identified a set of gene families whose sizes are positively associated with an expanded neocortex, providing new insights into neocortex evolution. Moreover, as aberrant development and degeneration of cortical neurons has been linked with a variety of mental health pathologies and dementias [123,124], identifying genomic signatures associated with the evolution of larger brain size and neocortex expansion will critically contribute to our understanding of the molecular pathways involved in the development and maintenance of cortical areas in highly encephalized mammals including humans. As these pathways may not be present or developed to the same extent in less encephalized mammalian species, our finding could help to fill existing gaps in current knowledge gained from widely used rodent models.
The electronic supplementary material, table S1 shows gene families significantly associated with Nr after correcting for the effect of Ei and MLSP and phylogenetic relatedness.
All authors contributed to the conception and design of the study. A.C.-M. and J.M.-S. carried out analyses presented. A.C.-M., A.A.d.S., A.O.U. and H.G. wrote the manuscript with contributions from all authors.
We have no competing interests.
This study was supported by a PhD CONACYT scholarship for A.C.-M and J.M.-S., a Royal Society Dorothy Hodgkin Research Fellowship (DH071902), a Royal Society research grant (RG0870644) and a Royal Society research grant for fellows (RG080272) to A.U.O. and a University of Lincoln PP grant to H.G.
Electronic supplementary material is available online at https://dx.doi.org/10.6084/m9.figshare.c.3489039.
- Received April 30, 2016.
- Accepted September 8, 2016.
- © 2016 The Authors.
Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.