Genomics and genetics of Sulfolobus islandicus LAL14/1, a model hyperthermophilic archaeon

The 2 465 177 bp genome of Sulfolobus islandicus LAL14/1, host of the model rudivirus SIRV2, was sequenced. Exhaustive comparative genomic analysis of S. islandicus LAL14/1 and the nine other completely sequenced S. islandicus strains isolated from Iceland, Russia and USA revealed a highly syntenic common core genome of approximately 2 Mb and a long hyperplastic region containing most of the strain-specific genes. In LAL14/1, the latter region is enriched in insertion sequences, CRISPR (clustered regularly interspaced short palindromic repeats), glycosyl transferase genes, toxin–antitoxin genes and MITE (miniature inverted-repeat transposable elements). The tRNA genes of LAL14/1 are preferential targets for the integration of mobile elements but clusters of atypical genes (CAG) are also integrated elsewhere in the genome. LAL14/1 carries five CRISPR loci with 10 per cent of spacers matching perfectly or imperfectly the genomes of archaeal viruses and plasmids found in the Icelandic hot springs. Strikingly, the CRISPR_2 region of LAL14/1 carries an unusually long 1.9 kb spacer interspersed between two repeat regions and displays a high similarity to pING1-like conjugative plasmids. Finally, we have developed a genetic system for S. islandicus LAL14/1 and created ΔpyrEF and ΔCRISPR_1 mutants using double cross-over and pop-in/pop-out approaches, respectively. Thus, LAL14/1 is a promising model to study virus–host interactions and the CRISPR/Cas defence mechanism in Archaea.


Summary
The 2 465 177 bp genome of Sulfolobus islandicus LAL14/1, host of the model rudivirus SIRV2, was sequenced. Exhaustive comparative genomic analysis of S. islandicus LAL14/1 and the nine other completely sequenced S. islandicus strains isolated from Iceland, Russia and USA revealed a highly syntenic common core genome of approximately 2 Mb and a long hyperplastic region containing most of the strain-specific genes. In LAL14/1, the latter region is enriched in insertion sequences, CRISPR (clustered regularly interspaced short palindromic repeats), glycosyl transferase genes, toxin-antitoxin genes and MITE (miniature invertedrepeat transposable elements). The tRNA genes of LAL14/1 are preferential targets for the integration of mobile elements but clusters of atypical genes (CAG) are also integrated elsewhere in the genome. LAL14/1 carries five CRISPR loci with 10 per cent of spacers matching perfectly or imperfectly the genomes of archaeal viruses and plasmids found in the Icelandic hot springs. Strikingly, the CRISPR_2 region of LAL14/1 carries an unusually long 1.9 kb spacer interspersed between two repeat regions and displays a high similarity to pING1-like conjugative plasmids. Finally, we have developed a genetic system for S. islandicus LAL14/1 and created DpyrEF and DCRISPR_1 mutants using double cross-over and pop-in/pop-out approaches, respectively. Thus, LAL14/1 is a promising model to study virus-host interactions and the CRISPR/Cas defence mechanism in Archaea.
heterotrophically [2]. Sulfolobus strains have been isolated from various acidic thermal habitats (in the USA, Italy, Iceland, Russia and elsewhere). They are easily maintained under laboratory conditions, making them convenient models to study the molecular organization of the archaeal cell [3].
The strain S. islandicus LAL14/1 was isolated in 1995 from a solfataric field in Iceland by the group of Zillig [8]. Its geographical origin, growth requirements and physiology indicate that LAL14/1 is a close relative of two S. islandicus strains also isolated from Iceland, HVE10/4 and REY15A. However, LAL14/1 has a particular pattern of sensitivity to various archaeal viruses. LAL14/1 is resistant to the rudivirus SIRV1 but can be efficiently infected by its close relative SIRV2 [10], which has a complex cycle of development in the host cells. At the end of the infection cycle, that lasts about 14 h, specific pyramid-like structures are formed on the cell surface facilitating release of virus particles [11][12][13]. These unique characteristics make S. islandicus LAL14/1 an interesting model to study virus-host interactions in Archaea.
In this study, we report the results of the in silico analysis of the genome sequence of S. islandicus LAL14/1 and detailed comparisons with other available S. islandicus strains, and in particular the closely related strains, REY15A and HVE10/4. We also have established genetic tools for this strain by creating both DpyrEF and DCRISPR_1 mutants. This work has made substantial progress towards the possibility of applying powerful global approaches (for example, transcriptome, RNAseq and proteome analyses) to elucidate the interplay between host and viral genes and proteins during the viral infection cycle.

Strains growth
Sulfolobus islandicus strains were grown aerobically at 808C and under constant agitation in rich medium containing 0.2 g l 21 of Tryptone Peptone, 2 g l 21 of sucrose and 1 g l 21 of yeast extract. The minimal medium used to select Ura þ variant isolates was as described previously [2]. Ura 2 mutants were selected on rich solid medium in the presence of 50 mg l 21 of 5 0 -fluoroorotic acid.

Genetic experiment
3.2.1. PCR amplification pyrEF mutant. The following primers were used to amplify the locus, including the pyrEF operon and the upstream (1 kb) and downstream (1 kb) situated regions of the S. islandicus E233S chromosome: oligoUP, CAGTAGCTAAAACAATTGAAAGA GTAGGTG; oligoDOWN, CTAATGATGCTTGATAGAAGTA TTTAGCGT. The PCR amplification was performed in 50 ml of reaction mixture containing 10 mM of each primer, 1 ml template, 10 ml 5Â HF Phusion Buffer (Finzyme), 10 nM dNTPs and 0.5 ml pfu DNA polymerase (Finzyme) with the following conditions: 30 s at 988C, 60 s at 558C and 90 s at 728C for 35 cycles.

Genome sequencing
Total DNA was extracted from the cells using phenolchloroform and ethanol precipitation. The sequencing was done by Fidelity System Inc. using Illumina technology and assembled using the software VELVET v. 1.2 [23]. The genome was automatically annotated and refined manually. Open reading frames (ORFs) were predicted using phred, phrap and consed software [24][25][26] and tRNA with tRNAscan-SE [27]. Putative insertion sequence (IS) elements were identified by BLASTn search against the IS Finder Database (http://www.is-biotoul.fr/). Annotations were manually curated using UGENE software [28].
The DNA sequences of all genomes were aligned using the progressiveMauve algorithm with default parameters and analysed with stripSubsetLCBs. The S. islandicus core genome sequence was used for phylogenetic dating based on the standard rate of accumulation of random mutations in hyperthermophilic Archaea (4.66 Â 10 29 substitution per site per year; [9]). Clonal genealogy was inferred using the ClonalFrame algorithm three times.

Dot-Plot
The UGENE Dot-Plot algorithm with the option 'search for inverted repeats' enabled was used to generate dot-plots from pairs of sequences. Sequences from the 10 S. islandicus strains were aligned with the progressiveMauve algorithm with scoring parameters divided by four to ensure the recognition of large conserved regions.

Exceptional motifs
R'MES software [29] was used to evaluate the significance of motif frequency in S. islandicus genomes. This statistical method compares the observed count of each motif to the count predicted by a reference probabilistic Markovian model. Exceptional motifs of lengths two to eight were analysed choosing models according to the authors' guidelines (http://migale.jouy.inra.fr/?q=method) and using the highest possible order each time (usually I-2, where I is the length of the studied motif ).

PAM motifs
To find PAM motifs, protospacers corresponding to the selected spacers listed in the electronic supplementary material, table S15 were aligned and visualized with WEBLOGO (http://weblogo.berkeley.edu/logo.cgi). For each protospacer, the regions analysed were 10 nucleotide-long sequences immediately upstream and downstream from the region identical or similar to the spacer.

Sulfolobus islandicus pan-genome characterization
The pan-genome of the 10 available S. islandicus strains was obtained with an in-house program [31]. Briefly, the proteins encoded by the 10 genomes were compared using two-way   years later. The tree is divided into three main clades corresponding to the geographical origins of the strains: clade I from Iceland; clade M from Kamchatka (Russia) and clade L/Y from Lassen and Yellowstone (USA) [9,33]. The general features of the 10 S. islandicus genomes are reported in table 2. LAL14/1 has the smallest genome, apparently due to the small number of horizontally transferred CAG regions (see below). It carries a remarkably complex CRISPR system, with more spacers than the other S. islandicus genomes, some matching the virus SIRV1 perfectly.

The
Sulfolobus islandicus pan-genome and specific genomic pattern of the strain LAL14/1 The first version of the S. islandicus pan-genome was published in 2009, based on the analysis of seven strains; it includes 20 610 proteins [9]. Three additional S. islandicus genome sequences have since become available (REY15A, HVE10/4 [7] and LAL14/1 ( present work)). The new updated version of the S. islandicus pan-genome has 27 578 proteins (in silico prediction) that can be divided into 3492 families of orthologous proteins (see §3). The statistics of the family distribution is presented in the electronic supplementary material, figure S1. There are 1892 ubiquitous families, present in at least one copy in all of the S. islandicus genomes. This group, indicated in the annotation by arCOG þ number, constitutes the S. islandicus core-genome.
There are 1030 families present in two or more (but not all) strains. The best-represented families of S. islandicus include various transposases, ABC transporters and CoA pathway genes (see the electronic supplementary material, table S1). The majority of these families are present in the LAL14/1 genome, and some are overrepresented (more frequent than predicted from the average pan-genome statistics), for example, the transposases belonging to the IS1 family. Others, for example transposases of families ISH3 and IS110, are clearly underrepresented.
The taxonomical distribution of the individual genes and singletons of S. islandicus LAL14/1 is summarized in table 3. Of the 2601 annotated protein-coding genes, only 4.7 per cent are exclusive to this strain; 10 per cent are only found in S. islandicus species and 37.6 per cent are specific to Sulfolobales. The 51 S. islandicus LAL14/1-specific singletons, and the seven S. islandicus-specific and seven Crenarchaeota- Table 2. Comparison of genomic patterns of 10 S. islandicus strains (compilation of our own data and those from [7,9]. CAG are discussed elsewhere in the text. SIRV1 and SIRV2 are rudiviruses described in [10,34,35]. R, resistant; S, sensitive; nd, no data. CRISPR and Cmr families are indicated following established classification [36,37].  The specific genomic pattern of LAL14/1 is presented in table 4. Like all the other studied S. islandicus genomes, LAL14/1 codes for a large number of transposases representing families composed of multiple paralogues. In this strain, the most represented family is the protein OrfB encoded by IS200/IS605. The transposon ISC1048 of the family IS607 is also overrepresented. However, other transposase families, such as families IS110 and ISH3, are clearly underrepresented in LAL14/1. The families of orthologous proteins were compared between the S. islandicus strains LAL14/1, REY15A and HVE10/4, all isolated from the same geographical location. This revealed substantial similarity between the genomes of these strains: of the 2770 families analysed, 2130 (77%) are shared by these three S. islandicus genomes. A surprisingly large number of families are unique for each of these strains (figure 2), constituting a specific genomic signature for each of the strains.

Exceptional motifs in Sulfolobus islandicus
LAL14/1 genome Many non-coding motifs have specific biological functions in genomes and statistical analyses of oligomer frequencies in genome sequences can identify possibly significant motifs (e.g. reviewed in [38] for bacteria). Similar analyses have also been very useful for studying the evolution of genome functions and regulation [39]. Systematic analyses of short oligonucleotides of fixed composition (usually called words) were conducted with the LAL14/1 and other S. islandicus genomes to identify noncoding functional motif candidates. The word frequency and preference patterns are very similar in the three closely related S. islandicus strains, suggesting that their functional motifs and associated mechanisms are generally similar. Some more pronounced differences were observed for longer words, and this was mainly associated with the different genomic content of the large variable regions.
As commonly observed in archaeal and bacterial genomes, palindromic motifs are generally avoided (e.g. electronic supplementary material , table S4 for LAL14/1 and  electronic supplementary material, table S5), possibly as a consequence of the presence of restriction-modification systems encoded in the genome [40] or of other biological phenomena such as the control of chromosome replication [38]. Among the analysed words (see the electronic supplementary material, tables S4 and S5), some are clearly overrepresented because of their presence in the repeats of the CRISPR sequences. For example, the overrepresented word ACTATAGA, included in the CRISPR repeats, is repeated 196 times (see the electronic supplementary material, figure S3).
For most of the other candidates, no obvious biological function could directly be inferred. There is evidence for eukaryotes that the non-random pattern of short words (two to four letters) may be due to evolutionary changes in informational processes such as DNA replication and repair, and the pattern of long words (eight letters) may reflect evolutionary changes in gene regulatory machinery. The presence of such long words may reflect a non-random frequency of the DNA-binding sites specific for transcription factors [39]. This observation might indicate the presence of eukaryotic-like transcriptional regulation in Archaea.

Structure and dynamics of the Sulfolobus islandicus genomes
The structural comparison of two closely related genomes, HVE10/4 and REY15A, was published in 2011 [7]. It revealed a high level of synteny of gene content for these genomes and the presence of two variable regions, one of about 0.5 -0.7 Mb and a second corresponding to a 200 kb inversion. The structure of the LAL14/1 genome was compared with those of the HVE10/4 and REY15A genomes by dot-plot analyses (see the electronic supplementary material, figure S4). This revealed a well-conserved 2 Mb core region common to all three genomes, with substantial synteny conservation. Many rearrangements were detected in the variable regions of each genome. The localization, length and genetic context of all major differences detected by the dot-plot approach are listed in the electronic supplementary material, table S6. Previous analysis revealed a large inversion of 0.5 Mb as one of the major differences between HVE10/4 and REY15A genomes [7]. Genome sequencing of the LAL14/1 in combination with the phylogenetic analysis (figure 1) allowed us to infer that the inversion occurred in the ancestor of the HVE10/4, while LAL14/1 retained the ancestral genome organization. The program progressiveMauve [41] was used for alignments of the 10 S. islandicus genomes and visualization of genome rearrangements (figure 3). All the genomes have a common general organization, with a well-preserved part covering about 75 per cent of the genome and a long variable region. Small strain-specific variable regions are scattered throughout the conserved regions of all of the 10 genomes analysed; many correspond to the insertion of heterologous genes transferred horizontally (see below).
Each genome contains several relatively small specific regions and all of them have a unique long variable region situated in the same segment of the genome (see the  S7). These variable regions range in size from 587 to 802 kb and have very heterogeneous genetic contexts. The largest variable region in S. islandicus LAL14/1 is 608 kb (between positions 282 and 890 kb) and represents 24.7 per cent of the genome. It does not contain any known essential genes, such as those for tRNA, rRNA or ribosomal proteins, or the replication origin sites (ori). Variable regions are usually preferential sites for integration and accumulation of non-essential genes in the genome [7] and the LAL14/1 genome is not an exception. The variable region of LAL14/1 carries most of its integrative elements, including both functional and inactivated IS, MITEs (miniature inverted-repeat transposable elements), the two largest CAG regions, all of the identified CRISPR/cas and cmr modules, half of the toxin/antitoxin genes, and many of the putative glycosylase genes (figure 4).

Analysis of tRNA genes, integration events and horizontal gene transfer
The pattern of tRNA genes in S. islandicus LAL14/1 is the same as those in S. islandicus REY15A and HVE10/4 [7] and very similar to those in other sequenced S. islandicus In Sulfolobales, the tRNA genes are preferential sites of integration of conjugative plasmids and fuselloviruses [4,42,43]. The mechanism of integration usually involves site-specific recombination between the tRNA gene target and the integrase gene (int) carried by an extrachromosomal element [44]. The presence of remnants of the corresponding int gene, overlapping the sequence of the tRNA gene target, often serves as a strong indication for an ancestral integrative event. The sequences of the remnants of the integrative elements are often incomplete or extensively degenerated, making their in silico identification challenging.
To identify the potential integrated extrachromosomal elements in the LAL14/1 genome, we set out to locate gene clusters enriched in homologues of proteins encoded by archaeal plasmids and viruses. For this purpose, all LAL14/ 1 proteins were compared (BLASTp) against the local protein database containing sequences of publicly available archaeal viruses (fifty-five) and crenarchaeal plasmids (twenty-five). Genomic loci containing at least five plasmid/viral homologues per 20 kb region were retained and manually inspected. This approach led to identification of four putative integrated elements. Notably, none of them appears to be functional, as judged from their incomplete gene complements when compared with 'autonomous' elements and lack of identifiable attachment sites. Three elements (SiL-E1 .SiL_1481]) are likely to be remnants of conjugative plasmids, while the fourth one (SiL_2367..SiL_2371) is related to SSV-like fuselloviruses. SiL-E2, SiL-E3 and the SSV-like element are located in the proximity of different tRNA genes, which probably served as their respective integration targets, while SiL-E1 was found within the CRISPR_2 locus (see below).
In order to uncover the potentially more ancient integration events at the tRNA genes, which could have eluded identification using the criteria detailed above, we have inspected  In general, the pattern of insertions linked to the tRNA genes in the LAL14/1 strain is similar, but not identical, to those in REY15A and HVE10/4 (  [GGA]), the three strains carry nearly identical remnants of the same integrated elements. This suggests that the respective integration events occurred in the common ancestor of REY15A, HVE10/4 and LAL14/1.
Proviruses are common companions of archaeal genomes [45 -47]. Thus, it was somewhat surprising not to find potentially functional proviruses in the LAL14/1 genome. The only virus-derived element of LAL14/1 integrated in the tRNA Thr [GGT] gene is a highly degenerated remnant of an SSV-like fusellovirus. Notably, the element does not appear to be closely related to any particular fusellovirus, since different genes display affinities to distinct fuselloviruses. rsob.royalsocietypublishing.org Open Biol 3: 130010 Table 5. Overview of integration events targeting the tRNA genes in three S. islandicus strains, LAL14/1 (present work) and REY15A and HVE10/4 [7]. For the corresponding lanes of the table, the sequences found in three strains are given in footnotes a-c. In the CAG column the main digit gives the number of genes and the upper small digit indicates the CAG score rated from 1 (most atypical) to 10 (not atypical). CAGs written in bold are detailed in  Distantly related sequences. c Not mentioned by [7]. rsob.royalsocietypublishing.org Open Biol 3: 130010 To gain an insight into the timeframe of this viral integration event, we analysed the equivalent loci in all available S. islandicus genomes. The traces of SSV integration were found in all S. islandicus strains, except for the M.16.27, which contained a gene for the pNOB8-type integrase at the equivalent position [48]. Interestingly, in all cases the elements were severely degenerated; a selection of genomic alignments can be found in the electronic supplementary material, figure S5. The most parsimonious scenario for the observed distribution of SSV-like remnants in S. islandicus genomes involves a single event of SSV-like virus genome integration into the tRNA Thr [GGT] gene, followed by gradual deterioration of the provirus along the evolutionary history of S. islandicus species. The integration has probably occurred following the divergence of S. islandicus and S. solfataricus from their common ancestor, since S. solfataricus lacks a detectable SSV-like element at the equivalent genomic locus. Identification of insertions by BLASTp analysis may be hampered by the insufficient conservation of the inserted genes or by limited coverage of the diversity of archaeal mobile genetic elements. Furthermore, some of the insertions could occur in loci other than the tRNA genes. To overcome these caveats, we have applied a BLASTp-independent approach based on the search of CAGs [17,49]. Following this approach, the putative integrated elements could be identified following their atypical codon usage compared with that of the conserved part of the host chromosome. For LAL14/1, HVE10/4 and REY15A the results obtained by this approach are summarized in table 6, and a brief general comparison of CAG distribution in 10 S. islandicus genomes is present in the electronic supplementary material, table S9. Notably, nearly all insertions detected in LAL14/1, HVE10/4 and REY15A by the BLASTp analyses were confirmed by the CAG approach; in addition, some of the integrative events were only predicted by the CAG search. LAL14/1 has three CAG regions (CAG1-3) of 14.6, 32.5 and 2.5 kb that carry 43 genes with atypical codon usage. CAG3 was found to correspond to SiL-E2 element integrated into tRNA Phe [GAA] identified by BLASTp analysis, while CAG1 and CAG2 could not be predicted by other approaches.
Some of the functions identified as being associated with the CAG loci are: the restriction-modification system I characteristic of HVE10/4; some elements of CRISPR-based immunity in HVE10/4 and LAL14/1; and various enzyme families (methyl-and glycosyltransferases, hydrogenases). The genes transferred horizontally and integrated into the chromosomes of these S. islandicus strains include many transposons and toxin/antitoxin gene pairs of the vapBC family.
To summarize, LAL14/1 carries the remnants of 13 insertion events into the tRNA genes, and three additional elements (SiL-E1, CAG1 and CAG2) are integrated into other loci; the same or a similar number of insertions is found in HVE10/4 and REY15A. These results further illustrate the fact that tRNA genes are frequently attacked by various mobile genetic elements. The observation that all (or at least the majority) of the integrated elements appear to be non-functional suggests that LAL14/1 possesses an efficient mechanism of purging its genome of unwelcomed insertions.

Replication origins, oriC
The positions of the replication origins in the chromosome of LAL14/1 were predicted by two independent approaches, Z-curve [50,51] and ACCA-plot [52], that produced very similar results (see the electronic supplementary material, figure  S6). Consistent with all other Sulfolobales genomes analysed in silico [7,53] or in vivo [54 -56], three oriC origins of replication were detected. Their positions and genomic contexts are well conserved with respect to other Sulfolobus genomes (see the electronic supplementary material, table S10 and [7]). The oriC1 site (mapped at position 1.59 Mb) is linked to the cdc6-1 gene (SiL_0002), oriC2 ( position 800 bp) is linked to the cdc6-3 gene (SiL_1733), and oriC3 ( position 1.15 Mb) to the cdc6-2/whiP genes (SiL_1228/SiL_1206).
The structures of the oriC1 and oriC2 sites in S. solfataricus P2 are well characterized [56,57], and these two replication origins are organized similarly in S. islandicus LAL14/1 (figure 5). Their central AT-rich UCM sequence (uncharacterized motif) is surrounded by the characteristic ORB sequences (origin recognition box). The oriC1 site also contains additional specific palindromic sequences, called C2 and C3, that are recognized by the replication initiation proteins Cdc6-1, Cdc6-2 and Cdc6-3 [56].
In Sulfolobales, the oriC3 site is usually linked to the whiP gene. A comparative analysis of oriC3 in the 10 S. islandicus strains and in S. solfataricus P2 provided new insights into the organization of this region: we identified three conserved ORB-like sites (nORB) upstream and downstream from the typical UCM site (figure 5). The oriC2 and oriC3 sites are organized similarly and, unlike oriC1, do not contain the C sequences.

Toxin -antitoxin systems
A family II (VapBC) toxin-antitoxin (TA) system is present in many Archaea and is very abundant in Sulfolobales [7,58,59]. All S. islandicus strains carry many TA gene pairs of the VapBC family as well as genes of another family considered to play a TA role [60]: HEPN-NT (Higher Eukaryotes and Prokaryotes Nucleotide-binding-Nucleotidyl Transferase; electronic supplementary material, table S11).
Eight of the 15 vapBC gene pairs in S. islandicus LAL14/1 map in the variable region of the genome. The other seven vapBC gene pairs map in the conserved part of the genome and all share a similar genetic context in the three strains analysed. For four of these loci (SiL_2040/2041, SiL_2042/2043, SiL_2080/2081, SiL_2253/2254), the genomic context is particularly well preserved. All vapBC loci, except SiL_2575/ 2576, are flanked by degenerated copies of IS elements, probably involved in the transposition of vapBC.
As observed in HVE10/4 and REY15A [7], the vapB (toxin) and vapC (antitoxin) genes of different subtypes were found in S. islandicus LAL14/1 in various combinations giving different variants of the vapBC operon (data not shown). This combinatorial diversity of vapBC gene pairs may indicate the existence of several types of the toxin/antitoxin mechanisms. All vapB and vapC gene combinations found in S. islandicus LAL14/1  rsob.royalsocietypublishing.org Open Biol 3: 130010 are also found in HVE10/4 and REY15A, except for SiL_0413/ 0414 and SiL_0631/0632 combinations present in LAL14/1 and HVE10/4 but not in REY15A.
Members of the toxin/antitoxin family HEPN-NT were detected in all S. islandicus strains, with multiple copies of the corresponding genes (see the electronic supplementary material, table S12). Unlike the vapBC system, HEPN-NT gene pairs are stable and each HEPN gene type is strictly associated with its specific NT gene type. HEPN-NT operons are classified into two subfamilies, I and II. Subfamily I is ubiquitous and all of its representatives in the 10 S. islandicus genomes analysed both occupy the same genetic regions and are always localized in conserved parts of the genomes. Subfamily II is much more diverse. Its representatives in S. islandicus map in both conserved and variable regions of the chromosome. Note that many copies of HEPN-HT family II pairs include only truncated forms of the HEPN gene and are not functional.
The production of sulfolobicins, a type of toxin that inhibits the growth of sensitive Sulfolobus strains, is characteristic of two other well-studied Sulfolobales, S. acidocaldarius and S. tokodaii [61][62][63]. No sulfolobicin-encoding genes, such as sulA, sulB and sulC, were found in any of the 10 S. islandicus genomes, indicating the absence of this toxin system from these species. Nevertheless, a truncated copy of the sulA gene, which obviously cannot code for a functional toxin, is present in S. islandicus REY15A [61].

UV-inducible type IV pili
Many Sulfolobales (ex. S. solfataricus, S. tokodaii and S. acidocaldarius) code for a UV-inducible type IV pilus system that promotes cellular aggregation and efficient exchange of chromosomal markers [64,65]. The formation of pili is controlled by the UV-inducible ups operon which comprises five genes: upsX, upsE, upsF, upsA and upsB [4][5][6]. This operon is present in all 10 S. islandicus strains (and all other sequenced species of Sulfolobales, electronic supplementary material, table S13) and is in all cases in the conserved part of the genome. The Ups proteins encoded by all S. islandicus are very similar and form a specific phylogenetic group within the Ups family in Sulfolobales (see the electronic supplementary material, figure S7).
The in silico data strongly suggest that the ups locus of S. islandicus is functional in vivo, and it very probably plays the same biological role as in S. solfataricus, S. tokodaii and S. acidocaldarius.

Insertion sequence elements and miniature inverted-repeat transposable elements
Sulfolobus islandicus LAL14/1, as HVE10/4 and REY15A, contains several families of IS elements with members present in multiple copies [7] (see the electronic supplementary material, table S14).
The orfB-containing IS (families IS605 and IS200/605) considered to be ancestral for the archaeal domain [66,67] are overrepresented in all three S. islandicus strains analysed: 95% of the copies of the orfB gene not linked to orfA in HVE10/4, REY15 and LAL14/1 were predicted to be functional.
Only seven of the 53 structurally valid IS (IS with intact inverted terminal repeats (ITRs)) in the LAL14/1 genome are predicted to code for functional transposases; transposase genes in the remaining 46 IS are truncated. The IS patterns of the two other strains, HVE10/4 and REY15A, are very different. Most of the IS in these genomes (45/76 in HVE10/4 and 52/96 in REY15A) code for a full-length transposase and are therefore predicted to be functional. However, many of the mutated IS may be mobilized by transposases of the same family encoded in trans [68]. Thus, 62 of the 76 IS (81.6%) detected in HVE10/4 and 87 of the 96 IS (90.6%) in REY15A could, in theory, be mobile (see the electronic supplementary material, table S14); the proportion is lower for LAL14/1, for which 31 of the 53 IS (58.4%) are potentially active.
This may indicate greater genetic stability of S. islandicus LAL14/1 than of either HVE10/4 or REY15A. Were this the case, strain LAL14/1 would be the most attractive model for genetic manipulations.
Another group of mobile elements in S. islandicus is MITEs, believed to correspond to truncated derivatives of autonomous DNA transposons [69][70][71][72][73]. MITEs exhibit the structural features of DNA transposons, containing terminal inverted repeats flanked by small direct repeats. The internal sequences of MITEs are short and devoid of ORFs. As nonautonomous elements, the transposition of MITEs is totally dependent on trans-acting transposases [68,74,75]. Only two classes of MITEs, SMN1 (320 bp) and SM3A (164 bp), were detected in the 10 S. islandicus genomes (see the electronic supplementary material, table S15). The SM3A family is more numerous in LAL14/1 than any of the other S. islandicus strains. The three S. islandicus strains from Iceland share two identical SM3A, but LAL14/1 also carries nine extra copies of SM3A that share only 95% similarity with other two SM3A copies. SMN1 transposition is dependent on the presence of a functional ISC1733 transposase and SM3A transposition on ISC1058 [68,76,77]. The MITEs of the SMN1 type in LAL14/1, HVE10/4 and REY15A could be mobilized by the ISC1733 type transposase [76] predicted to be functional in these strains. The observation of the mobilization of SMN1 in S. islandicus REN1H1 is consistent with this prediction [76]. None of the three S. islandicus strains analysed codes for a functional ISC1058 transposase, suggesting that SM3A, although present, cannot transpose in HVE10/4, REY15A and LAL14/1.

CRISPRs: structure, targets and phenotype
All sequences and genes related to the CRISPR system present in S. islandicus LAL14/1 (CRISPR arrays, cas and cmr gene cassettes) map within the large variable genomic region. It carries five CRISPR loci, three cas gene cassettes associated with subtype I-A and two cmr gene cassettes associated with subtype III-B [78,79]. Following the leader and repeat sequence compositions, the five CRISPRs of S. islandicus LAL14/1 could be divided into two families, I and III [36], while the two cmr modules belong to families B and F ( [37]; figure 6 and table 7).
The family I CRISPR locus comprises two oppositely oriented blocks of repeat-spacer arrays separated by the first module of Cas genes. The second module is situated at the end of one of the repeat-spacer arrays (figure 6a). The additional Cmr module is not linked to this part and maps several hundred kilobases downstream in the genome. The family III CRISPR (figure 6a) comprises three clusters of spacers/repeats associated with two gene modules, one including the cas gene and another cmr. The cmr gene order in this module is the same as that in the Cmr module associated with the family I CRISPR.
rsob.royalsocietypublishing.org Open Biol 3: 130010 The analysis of 285 spacers forming the CRISPR array of LAL14/1 revealed the presence of a surprisingly high number of spacers that perfectly (in three cases) or imperfectly (30 cases; electronic supplementary material, table S16) match the genomes of rudiviruses, fuselloviruses and conjugative plasmids previously described in the Icelandic hot spring   environments [34,35]. The two perfectly matching spacers carried by CRISPR_1 and CRISPR_5 target the genome of the rudivirus SIRV1 [80]. Interestingly, the spacer in CRISPR_1 matches a SIRV1 gene encoding a protein, P98, responsible for formation of pyramidal structures involved in virion egress [11][12][13]. No spacers with 100 per cent identity to a closely related virus, SIRV2, were detected. The third perfectly matching spacer is identical to a sequence in the conjugative plasmid pARN3 [81]. Unexpectedly, we have identified an about 2 kb insertion, SiL-E1, in the CRISPR_2 locus. SiL-E1 resembles the typical CRISPR spacers in that it is interspersed between two identical repeats and is followed by additional spacer-repeat units ( figure 6). This pseudo-spacer encompasses five ORFs and displays high sequence similarity to and collinearity with pING1-like conjugative plasmids of S. islandicus [81][82][83]. More specifically, SiL-E1 shares overall 82 per cent identity with plasmids pING1 [83] and pHVE14 [81]. SiL-E1 is not present in other S. islandicus strains. Notably, sequence similarity between SiL-E1 and pING1-like plasmids extends throughout the length of SiL-E1, leaving no unaccounted positions between the inserted sequence and the repeat regions ( figure 6c). This suggests that SiL-E1 is unlikely to be a result of illegitimate recombination. The analysis of the protospacer corresponding to the spacers listed in the electronic supplementary material, table S15 allows the identification of the PAM sequence ( protospacer adjacent motif ). These sequences situated at the proximity of protospacers are crucial for two essential steps of CRISPR-based immunity: adaptation [84] and interference [37,85,86]. They also play an important role in the mechanism of target discrimination that prevents the recognition of chromosomal spacers as valid targets [87].
For the LAL14/1 CRISPRs of the family I, we found the same PAM motif, CC, in the position (23,22) at the 5 0 end as was already described by Gudbergsdottir et al. [88] (figure 7a). No specific PAM motif was detected at the 3' end of these protospacers (figure 7b). Little is known about the protospacers corresponding to the CRISPRs of the family III. Our data indicate the existence of a conserved motif [T/A]GT occupying the position (24,23,22) at the 5 0 end of the protospacer (figure 7c). The 3 0 end of these protospacers is very rich in A/T nucleotides (figure 7d).

Development of a genetic model
Sulfolobus islandicus LAL14/1 is a promising model for studying virus -host interaction in Archaea. LAL14/1 cells can be infected by SIRV2, a model rod-shaped virus [89] that codes for a unique mechanism of virion release: pyramidal structures form on the host cell surface, breaking the S-layer and allowing the virions to escape from the cells [11][12][13]90]. Investigations on SIRV2 cycle regulation and SIRV2host interaction will require genetic tools for strain LAL14/1 as no available genetic models of Sulfolobus can be infected by SIRV2.
One of the most commonly used genetic markers in Archaea is the pyrEF operon. Pyrimidine prototrophs (Pyrþ) can be easily selected, on minimal medium without uracil, after transformation of a pyrEF-host strain by a plasmid or viral vector carrying the wild-type pyrEF operon [19,91 -94]. A non-reversing spontaneous pyrEF-deletion mutant of S. islandicus REY15A is widely used as a host for genetic manipulations [20]. No such mutant of S. islandicus LAL14/1 was available, so we constructed a DpyrEF mutant via allelic replacement approach.
To generate a pyrEF disruption mutant, a knockout cassette containing the DpyrEF allele from S. islandicus REY15A, strain E233S [19] and 1 kb regions situated downstream and upstream from pyrEF was obtained by PCR amplification (see §3). The S. islandicus LAL14/1 cells were transformed by this linear DNA fragment of 2233 bp and the DpyrEF mutants resulting from the replacement of the wild-type copy of the pyrEF operon on the host chromosome by a double crossover were selected on 5 0 FOA (5 0 -fluoroorotic acid). Twenty transformants were selected and the pyrEF operon was analysed by PCR and sequencing; 15 of the analysed colonies carried the expected DpyrEF deletion. This mutant strain, called S. islandicus LAL14/1-CD, showed the same virus resistance/sensitivity phenotype as the parental strain (data not shown). We confirmed that this mutant is indeed derived from strain LAL14/1 by sequencing the gene coding for the A subunit of the cytochrome b558/566 (Sil_2350; the sequence of this gene is not identical in REY15A, HV10/4 and LAL14/1). Sulfolobus islandicus LAL14/1-CD could be efficiently transformed with the pHZ2 ( pRN2 replicon) [19] autonomously replicating in S. islandicus and carrying a wild-type copy of the pyrEF operon. The transformation efficiency was 10 2 -10 3 colonies/mg of DNA.
A powerful genetic pop-in/pop-out approach was previously developed for another genetic model, S. islandicus REY15A [19]. It allows rapid and efficient creation of knockout mutants. To show that this approach is efficient in LAL14/1, we have chosen to delete one of the CRISPR loci (CRISPR_1), because CRISPR-coded functions are usually not essential for the cells in the absence of viruses and their deletion mutants are expected to be viable. Also, one of the spacers of the CRISPR_1 matches perfectly the SIRV1 virus for which LAL14/1 is resistant. If the resistance is linked to the CRISPR activity, its inactivation could decrease the level of resistance giving a detectable phenotype to this mutant.
The recombinant plasmid used to inactivate the CRISPR_1 and the positions of the regions IN (867 bp), OUT (919 bp) and TARGET (776 bp) in the vector pSEF described by Deng et al. [19] are indicated in figure 8.
Two successive rounds of recombination (figure 8) deleted the chromosomal fragment situated between the IN and OUT regions producing a pyrEFþ derivative from which the CRISPR_1/cas region has been deleted (DCRISPR_1/ Dcsx1Dcas4; csx1 is annotated as Sil_0393). The deletion was confirmed by PCR analysis (data not shown). Interestingly, DCRISPR_1/Dcsx1Dcas4 was as resistant to infection by SIRV1 as S. islandicus LAL14/1-CD. The presence of a second spacer and functional CRISPR_3 may explain this result.
The observed efficient transformation of LAL14/1 as well the ease of creation and selection of its deletion mutants indicate that LAL14/1 represents an excellent genetic model.

Discussion
Sulfolobus islandicus LAL14/1 is a promising model for studies on virus -host interaction and CRISPR/cas-based acquired immunity in hyperthermophilic Archaea. We report an extensive comparative in silico analysis of its genome and established this strain as genetic model.
The genome of LAL14/1 is the 10th of the species S. islandicus to be sequenced [9], and the third of an S. islandicus strain isolated in Iceland [8]. Strain LAL14/1 has the smallest known S. islandicus genome: it has only 2601 genes carried by a 2.47 Mb chromosome. With other sequenced S. islandicus strains LAL14/1 shares the same major groups of paralogous genes, most of which are transposases of various families (table 4). Its genome also encodes a large number of diverse ABC transporters, including the oligopeptide transporter. This is consistent with the high frequency of isolation of S. islandicus strains from enrichment cultures using rich organic media [7]. Indeed, S. islandicus LAL14/1 grows heterotrophically on standard laboratory liquid media with yeast extract as main carbon source.
The S. islandicus pan-genome contains 570 singletons representing a strain-specific set of proteins; 65 are only found in S. islandicus LAL14/1. As has been widely documented for other prokaryotic virus/host models (see review [95]), it is possible that some of the particular features of LAL14/1, for example, virus -host range, could be linked to the presence of specific genes or gene repertoires absent from other S. islandicus strains. A large proportion of the strain-specific genes map in a large variable region. In each of the S. islandicus genomes analysed this region covers more than 25 per cent of the chromosome length, and contains most of the transposons promoting the horizontal gene transfer and all CRISPR sequences (figure 4).
Our in silico analysis of the relics of mobile elements in the genomes of the three closely related S. islandicus strains confirms that mobile genetic elements preferentially integrate into tRNA genes. Nevertheless, searches for CAG revealed several previously undescribed long DNA segments of heterologous origin that were located in loci other than tRNA genes [49]. These clusters, most probably the consequences of genetic transfer via conjugative plasmids or other mobile genetic elements, contain 1.6 per cent of the genes in LAL14/1. Biological functions can be predicted for only a small fraction of these genes. For example, the CAGs carry several copies of vapBC genes of toxin/antitoxin systems, some CRISPR-related genes and genes coding for methyl-and glycosyltransferases.
A comparative analysis of the genome structure and composition of three closely related strains (HVE10/4, REY15A and LAL14/1) confirms that they have a very similar genomic pattern with a strong conservation of synteny. Nevertheless, the presence in each of these genomes of multiple local rearrangements raised the issue of the stability of the LAL14/1 genome. All three strains carry many copies of IS elements of various families. The transposition of an IS or transposon is an important source of genome instability and rearrangements in any cell [96]. Such instability is well documented in the case of REY15A, for which a relatively rsob.royalsocietypublishing.org Open Biol 3: 130010 high incidence of the pyrEF-deletion mutants (one from 50 analysed PyrEF-colonies) is observed [7,19]. Sulfolobus islandicus LAL14/1 seems to be genetically more stable as no deletions in the pyrEF locus were detected by PCR analysis of 100 colonies of spontaneous LAL14/1 pyrEF mutants resistant to FOA (C. Jaubert, C. Danioux, G. Sezonov 2013, unpublished data). This could be due to a lower transposition activity in LAL14/1 and consequently lower frequency of genome rearrangements.
Thus, in vivo and in silico indications concerning the stability of the genome of strain LAL14/1 suggest that it would be a useful model for genetic studies. Strain LAL14/1 is the host of the model rudivirus SIRV2 [10,89,97] and has been used to study virus -host interactions in Archaea [11][12][13]. The availability of the sequence of the LAL14/1 genome makes global genomic analysis of the interaction between viral and host genomes during the infection cycle possible. A genetic approach would facilitate investigations of the role of particular host genes involved in this interaction, as well as the host immune response dependent on the activity of CRISPRs. We successfully inactivated two genetic loci in LAL14/1 of different sizes, pyrEF (2.2 kb) and CRISPR_1/  Figure 8. Genetic map of the CRISPR_1 region deleted by the pop-in/pop-out approach. (a) Genetic map of the CRISPR_1 region. The deleted region is situated between the regions OUT and TARGET. (b) A scheme representing two stages of recombination events generating the mutant DCRISPR_1Dcsx1**Dcas4D pyrEF. Two paralogues of csx1 are present in this region. The gene csx1* corresponds to the gene Sil_0392 and csx1** to Sil_0393. cas (6.6 kb), using allelic replacement and in-frame markerless genetic exchange approaches. We thereby demonstrated the potential of LAL14/1 for genetic experimentation.
More than 90 per cent of the archaeal genomes analysed code for an adaptive immunity system called the CRISPR/cas system [98][99][100]. In LAL14/1, CRISPRs are represented by five loci associated with three cas (subtype I-A) and two cmr (subtype III-B) gene cassettes. The presence in the CRISPR arrays of LAL14/1 spacers matching extrachromosomal elements (SIRV1 virus, pARN1) make this strain an interesting model for studying the biological and functional role of CRISPRs in Archaea. The presence of two spacers perfectly matching the genome of SIRV1 virus allows speculation about a connexion between the CRISPR composition and the SIRV1-resistance phenotype of S. islandicus LAL14/1. Deletion of the CRISPR_1 carrying one of two SIRV1-specific spacers did not change the phenotype of the obtained mutant; it remained as resistant to SIRV1 as the initial strain. A construction of a double mutant DCRISPR_1 DCRISPR_5 will help to better characterize the eventual involvement of CRISPR generated immunity in resistance of LAL14/1 to SIRV1.
Recent publications report spacer acquisition under laboratory conditions in several bacterial [84,[101][102][103][104] and archaeal (S. solfataricus) [105] models. However, analysis of the CRISPR content of 12 independent SIRV2-resistant mutants of LAL14/ 1 did not detect any new insertions in the CRISPR sequences (data not shown). Consequently, the acquired resistance does not appear to be related to CRISPRs and presumably involves a different mechanism of resistance [106].
LAL14/1 also carries a unique pseudo-spacer, SiL-E1, which represents an insertion of an approximately 2 kb region from a pING1-like plasmid into the CRISPR_2 array. To our knowledge, such large spacers have not been previously described in archaeal or bacterial CRISPR loci. The acquisition mechanism of this pseudo-spacer as well as its role in LAL14/1 immunity against conjugative plasmids is unclear. However, the fact that SiL-E1 is flanked by perfect repeats of CRISPR_2 (figure 6c) argues against the possibility of a random integration event. Plausible acquisition scenarios include faulty protospacer processing by the Cas machinery or homologous recombination between the episomal plasmid and the pre-existing CRISPR_2 spacer(s) matching the plasmid. Future studies should provide important additional information regarding spacer acquisition mechanisms and reveal whether such atypical spacers are competent in conferring immunity against mobile genetic elements in Archaea.
Physical isolation of the geothermal hot spring in which S. islandicus thrives makes this species a very valuable model to study microbial speciation and evolution [107]. Such studies were recently conducted on S. islandicus strains isolated from the hot spring located in Russia ('M') and USA ('Y/N') [9,33], and have already provided important insights into the population dynamics of hyperthermophilic Archaea. Our in-depth comparative genomics analysis clearly indicates divergence of the 'Icelandic trinity'-LAL14/1, HVE10/4 and REY15A-from the groups 'M' and 'L/Y', supporting previously suggested biogeographical patterns of differentiation of S. islandicus species. Genetic tools developed in this study and those available for REY15A will help to experimentally tackle questions regarding the evolution and divergence of these Icelandic strains and compare the elucidated patterns with those available for S. islandicus strains isolated from other continents.