Ocean Viruses: On Their Abundance, Diversity, and Target Hosts
Ocean Viruses: On Their Abundance, Diversity, and Target Hosts
Abstract and Keywords
This chapter introduces the theoretical and modeling approaches necessary to estimate (i) viral abundance, (ii) viral diversity, and (iii) virus–host interactions. Viruses are extremely abundant in the oceans, with estimates of virus-like particle densities ranging from approximately 104 to 108/ml. Virus abundance is estimated to be at its highest in coastal environments, during blooms, and in sediments. Viral diversity remains elusive. Those features of viral diversity that are estimable include Shannon and Simpson diversity, and should be utilized instead of attempting to estimate the total number of virus “species” in the community based on measurements of a small subsample. Viral diversity includes genotypic, genetic, and functional diversity. Individual viruses infect more than one host type, and individual hosts are infected by more than one virus. The cross-infection networks in natural systems include evidence of specialization, as measured by modularity, and hierarchical order, as measured by nestedness.
6.1 Ways of Seeing
Imagine walking through a forest habitat, albeit with a purpose: to estimate how many birds there are, how many distinct species of birds are present, and, finally, the population-level growth and death rates of each type of bird. Easy, no? Certainly, such a census requires time, expertise, and repeated surveys. There will be challenges, for example, if the rare bird is neither seen nor heard during one or repeated surveys. Or, even if a bird is common, it may return to a survey site only seasonally. Finally, there may be common birds that are, nonetheless, hard to detect. But given sufficient resources, a survey could provide informative data on the diversity, distribution, and dynamics of bird species. In fact, academic and lay birders, routinely conduct such surveys for years. These surveys are so data rich that it is now possible to assess global-scale patterns in bird diversity, ecology, and aspects of evolutionary change (Jetz et al. 2007). Now, imagine a forest habitat in which all the robins, perhaps half of the thrush, and a sprinkling of warblers suddenly became invisible. Moreover, imagine that 90% of the trees and shrubs that the birds lived in and among also became invisible. It is true that such decreases, of varying scales, are taking place owing to habitat loss and other factors. However, the scenario you are being asked to imagine is one in which the birds, trees, and shrubs remain, but they cannot be detected. The situation sounds far-fetched, but in such a place what would you think you knew about the distribution, diversity, and population dynamics of the biota that turned out to be incomplete, misdirected, or perhaps just flat-out wrong?
This vision of a complex, diverse world that is nonetheless largely invisible to us is an apt reminder of the challenges we face in trying to grasp—to say nothing of understand—the complexity of the world of microbes and the viruses that infect them. It is this world, or at least this worldview, that predominated until the late 1980s. Bacteriophage and other viruses of microbes (p.164) were used as the basis for spectacular advances in molecular biology and were themselves the subject of laboratory experiments in population and evolutionary biology. Yet, viruses remained a rather microscopic sidelight from an ecological or environmental science perspective. The state of the field was summarized by Bergh et al. (1989):
The concentration of bacteriophages in natural unpolluted waters is in general believed to be low, and they have therefore been considered ecologically unimportant.
This belief was informed, in part, by the tools for measuring viruses available to that point. The plaque assay was, at the time, the standard method of counting the number of environmental viruses. As explained in Chapter 2, the plaque assay is a culture-based method. It remains the gold standard for counting virus particles, given a host-virus pair in culture. Given an initial environmental sample from which cells have been removed, the total number of plaques can be used to estimate the number of viruses able to infect, propagate, and lyse a target host or hosts. Because viruses are known to differ in the hosts they infact, if the wrong host is used or conditions are not ideal for virus propagation, then such methods may lead to significant errors in virus count estimates. Moreover, and perhaps more fundamentally, it is now widely recognized that the majority of hosts cannot yet be cultured. James Staley and Allan Konopka proposed the term “The Great Plate Count Anomaly” to consolidate various lines of evidence that the estimated number of bacteria from colony-forming unit assays were two orders of magnitude smaller than the inferred number of bacteria from production, division, and microscopy counts (Staley and Konopka 1985). It is hard to count viruses if one cannot even count their hosts.
To better estimate the population size of naturally occurring viruses, Bergh and colleagues Knut Børsheim, Gunnar Bratbak, and Mikal Heldal proposed a culture-independent counting method. This method involves centrifuging a sample; staining the water-free pellet with uranyl acetate, which binds to many proteins and lipids as well as to nucleic acids; and then viewing the sample at high magnification using transmission electron microscopy. The results can be seen in Figure 1 of Bergh et al. (1989). In retrospect, the micrograph hardly seems impressive. It is apparent that the sample contains many particles that are on the order of 1 µm in size and many smaller, circular particles on the order of 50 nm in size. The large particles are likely bacteria. What was truly innovative in this work was the hypothesis that the small particles were viruses, formally termed virus-like particles. The team observed virus-like particles with diameters ranging from less than 50 nm to upward of 100 nm. (p.165) They estimated total virus densities in excess of 2.5 × 108/ml. In the authors’ own words: these “viral counts are 103–107 times higher than previous reports on virus numbers in natural aquatic environments, which are based on counts of plaque-forming units of various bacteria.” Subsequent studies supported these counts based on quantification with epifluorescence microscopy (Noble and Fuhrman 1998). In the words of Duncan Watts, “Everything is obvious, once you know the answer” (Watts 2012). The expression is rather apt in this case—the finding was discoverable for 50 years once the electron microscope was developed, if others had only thought to look!
This discovery facilitated rapid advances in the study of viral ecology. Knowing that viruses are abundant leads to many related questions: for example, how do viruses maintain such large total populations? Whom do they infect? How diverse are they? How important are they in modifying not only the fate of individual hosts but also the flux of nutrients and other small molecules in the environment? These questions are, in some sense, universal. They could be asked in the context of the surface, deep, or coastal oceans; in soils; or in microbiomes. Here, I will focus on aquatic ecosystems, and in particular, surface ocean waters, to address what theory and quantitative analysis can contribute to answering these questions. This chapter introduces the theoretical and modeling approaches necessary to estimate (i) viral abundance, (ii) viral diversity, and (iii) virus-host interactions. Then, Chapter 7 introduces a series of dynamic models for exploring the mechanisms underlying these emergent features of complex ecosystems.
6.2 Counting Viruses in the Environment
Modern methods for estimating the number of viruses in an environmental sample include electron microscopy, epifluorescence microscopy, and flow cytometry. The Manual of Aquatic Virology (MAVE) (Wilhelm et al. 2010) provides detailed procedures for implementing each of these methods, for example, in Ackermann and Heldal (2010) (for electron microscopy), Suttle and Fuhrman (2010) (for epifluorescence microscopy), and Brussaard et al. (2010) (for flow cytometry). All the methods share a common set of principles: the sample is physically and chemically prepared, then measured in the device, then labeled as a virus based on its physical features (e.g., estimated diameter), and/or interactions with the apparatus (e.g., degree of scattering). Despite the many protocols for estimating virus abundance, the classification step is usually researcher dependent; for example, “Viruses can generally be distinguished from cells by their staining characteristics. Viruses appear as (p.166) bright pinpricks of light, whereas cells generally have discernable size.” (Suttle and Fuhrman 2010). This leads to a challenge: how does one count small things with big, clumsy hands.1
The intuition and training of researchers has been essential to distinguishing virus-like particles from non-virus-like particles. Indeed, detailed comparisons of protocols with gold-standard approaches, including plaque assays and transmission electron microscopy, show that such classifications can lead to consistent estimates (Wilhelm et al. 2010), but standardization of the classification step is both warranted and feasible. Despite prior successes, the sample-to-abundance pipeline is hardly perfect. There are many new environments (e.g., the human microbiome (Minot et al. 2013; Barr et al. 2013)) in which the roles of the viruses of microbes are now being studied and for which new measuring protocols need be developed. Finally, it is no longer self-evident that virus should appear as pinpricks and that bacteria should appear with a discernible size. Giant viruses having discernible size infect amoeba and protists. Examples include the Chlorella virus, mimivirus, Mamavirus, and Leviathan, all of which exceed 300 nm in diameter, with the largest estimated virus reaching 400 nm in diameter (Xiao et al. 2005). Similarly, marine bacteria are often much smaller than routinely used lab strains. For example, individual cyanobacteria from the clades Prochlorococcus and Synechococcus are approximately 400 nm in diameter (Chisholm 1992). Likewise, the ubiquitous Pelagibacter clade includes individuals approximately 500 nm in diameter (Rappé and Giovannoni 2003). Furthermore, a ubiquitous actinobacter (representing 5% of the total surface microbe population in seasonal measurements) is estimated to be 300 nm in diameter (Ghai et al. 2013). Although viruses tend to be smaller than bacteria, this is not universally the case.
Improvements in sample preparation and imaging protocols also represent opportunities for theoreticians, and in particular imaging scientists, to improve sample measurement protocols. Appendix D.1 includes one such algorithmic protocol for standardizing the virus-particle estimation component of the sample-abundance pipeline as estimated via epifluorescence microscopy. It is not the only such protocol, but it is easily reproducible. The objective of the algorithm is to formalize the intuition of the researcher who must distinguish the signal due to virus-like particles from background noise. The noise includes background scattering and/or fluorescence associated with the sample preparation, and signal coming from other organisms in the sample, such as bacteria, diatoms, or zooplankton.
Figure 6.1 illustrates the results of this protocol. The original grayscale image was segmented, in this case with a threshold intensity value determined by Otsu’s algorithm (Otsu 1979). The segmented image is further labeled according to whether the clusters appear “virus-like,” in this case whether their effective diameter is smaller than 250 nm, and the eccentricity of the best-fitting ellipse of the cluster is less than 0.9. In this way, the computation of the number of virus-like particles (VLPs) can be completely automated given (p.168) a sample image. Posterior estimates will depend on parameters, for example, whether a particle of size 50 nm or 150 nm is considered a virus-like particle. Selecting appropriate cutoffs is essential to improving estimates of the total abundances of both viruses and small bacteria. The last panel of Figure 6.1 illustrates the dramatic change in the number of virus-like particles with cutoff parameters. As is apparent, the distribution of sizes in environmental samples is not always bimodal. The variation in size cutoffs extends beyond what is typically considered for the identification of a VLP. Nonetheless, the point here is that a demarcation between pinpricks and cells of discernible size can be hard to establish. This area is underexplored, both from a methodological and environmental science perspective.
6.2.1 Variation in the abundance of marine viruses
The study of marine viruses and their effects on microbial communities and ocean ecosystems has increased significantly since Bergh and colleagues’ finding of elevated marine virus abundances in aquatic ecosystems (Bergh et al. 1989). As but one indication, the total number of papers whose topics include either “ocean viruses” or “marine viruses” (according to Web of Science) grew from 3 in 1989 to more than 300 in 2013. The number of citations to such papers grew from approximately 20 in 1990 to more than 12,000 in 2013. Many of these studies include the direct enumeration of total virus abundance. It is now evident that viruses are, in fact, one of the most abundant, if not the most abundant, biological entity in the surface ocean (Suttle 2005). A recent compilation of published and unpublished data from both surface ocean waters and marine sediments revealed significant variation in virus densities, from 104/ml to more than 108/ml (Figures 6.2A and B). What underlies the four orders of magnitude variation in marine virus abundance?
A comprehensive answer to such a question remains elusive. Indeed, any comprehensive answer must come with significant caveats. In particular, there are currently a few thousand direct measurements of virus abundances. Each measurement reflects significant planning, expertise, and (often) expense. The vast majority of these measurements were taken via oceanographic research vessels, of which virus ecology is usually a small component of a large suite of activities. It is an understatement to say that current measurements under-sample the 3.6 × 108 km2 surface area of the global oceans, to say nothing of the ocean depths. Nonetheless, available measurements provide preliminary evidence for emergent patterns.
are log10 V = 1.03 log10 N + 0.660 for the water column (n = 631, R2 = 0.698, P < 0.001) and log10 V = 0.761 log10 N + 2.35 for the sediment (n = 305, R2 = 0.641, P < 0.001), where V and N are virus and prokaryotic density, respectively. For those unfamiliar with virus-host models, such a correlation would appear to be expected, since the greater availability of bacteria and archaea would seem to imply more targets for viruses. Yet, viruses are thought to control, at least in part, the number of prokaryotes in a given community. In fact, the simplest model of virus-host dynamics leads to the conclusion that increases in resource supply lead to increases in virus, but not host, density. An alternative outcome may be that virus abundance is positively correlated with resource supply but uncorrelated with prokaryotic abundance. In fact, virus abundance is correlated with the amount of chlorophyll, a proxy for the productivity of the environment (Figure 6.2B). However, bacterial abundances are also correlated with chlorophyll and other indexes of ecosystem productivity (Williams and Follows 2011). These patterns are a key target for ecosystem-level studies and mechanistic models that include not only virus-host interactions but also the interactions between prokaryotes and other predators, like zooplankton. In explaining patterns, it may be necessary to consider the effects of specialization and generalism within populations, for (p.170) example, as in analyses of zooplankton-phytoplankton biomass patterns along productivity gradients (Leibold 1989).
Working at global scales can be daunting. Instead, one may ask: what determines the site-specific variability in virus abundances? A few case studies will help illustrate the different types of regimes to be expected in the surface oceans. A major area of oceanographic research focuses on oligotrophic gyres. Oligotrophic is a term used to describe the majority of “blue ocean” environments, that is, with relatively low inorganic nutrients, for example, <0.5 µg/L of chlorophyll a in the surface. The term gyre refers to a rotating, circulating pattern in ocean currents, such as the North Pacific Ocean subtropical gyre. This type of oceanic region has been of interest for decades and for various reasons. For example, it is there that C. D. Keeling initiated the now-renowned Mauna Loa time series showing rises in atmospheric CO2 over the past 50 years (Keeling et al. 1995; Williams and Follows 2011). A few researchers, including Grieg Steward, Alex Culley, and Jennifer Brum have all measured virus abundances in this gyre in the top 100 m of surface waters. They found total virus abundances on the order of 5 × 109 L (Culley and Welschmeyer 2002) and 1010 L (Brum 2005). Simultaneous measurements of bacterial densities were approximately one order of magnitude smaller.
All the ocean is not oligotrophic. Coastal waters receive significant organic and inorganic nutrient influx owing to river runoff, and upwelling events from the deep ocean is often associated with “bloom” events, in which the population abundances of plankton rise rapidly. Both coastal waters and bloom events are known to have elevated levels of viruses. For example, estimates of virus abundance exceeded 1011/L in surface water measurements during a spring bloom in the Southern Pacific Ocean near New Zealand (Matteson et al. 2012; Strzepek et al. 2005). The total chlorophyll a concentration was 2.48 µg/L, nearly five times larger than at the Hawaii Ocean Time-series (HOT) site. And, unlike at HOT, bacteria abundances peaked at 1.5 × 109/L; that is, Wilhelm and collaborators observed virus to bacterium ratios exceeding 100:1 (Matteson et al. 2012). As is evident, distinct oceanic realms have significant differences in host and virus densities and in the emergent relationships between them.
These prior estimates also enable rough estimates of the total marine virus abundance in the oceans. Such estimates are admittedly coarse. For example, assuming virus abundances of 3 × 109/L in 70% of the surface oceans and 1 × 1010/L in the remainder suggests an average surface ocean concentration of 5 × 109/L. If such averages are characteristic of the top 100 m of the ocean, then there should be 2 × 1029 marine viruses in the surface oceans. This estimate neglects viruses deeper in the water column and in marine sediments (Danovaro et al. 2011).
(p.171) The abundance of viral genomes inside lysogens or infected cells is also poorly quantified. The fraction of lysogenic cells can be estimated in a culture-independent fashion. The general technique is to use an “inducing” agent, for example, a chemical like mitomycin C or direct UV exposure, which has been shown in many cultured systems to induce active virus propagation within lysogens (Paul and Weinbauer 2010). For example, in a study of marine surface waters in the Arctic Ocean, Jerome Payet and Curtis Suttle found that 5%–40% of microbial cells were inducible (Payet and Suttle 2013). Moreover, the fraction of inducible cells was found to be inversely correlated with free virus abundance, suggesting that total virus counts as measured using microscopy of free viruses will nonetheless underestimate the total number of viral genomes in the environment.
Finally, any discussion of virus abundances would be incomplete without mentioning the important fact that current staining methods that are prerequisites for both epifluorescence microscopy and flow cytometry are not effective with RNA-based viruses (Steward et al. 2012) nor with ssDNA viruses (Holmfeldt et al. 2012). Grieg Steward and colleagues provocatively asked, “Are we missing half the viruses in the ocean?” (Steward et al. 2012). Perhaps. Then again, there are others, including Patrick Forterre, who caution that measurements of viruses-like particles do not necessarily guarantee that the particles are viruses (Forterre et al. 2013). For example, they could be gene-transfer agents (GTAs). GTAs are vesicle-bound particles containing nucleic acids, usually those of microbial hosts, that can be released and then taken up by other microbes. They are spherical and often on the order of 30–50 nm in diameter. In 2014, Penny Chisholm’s group reported evidence of widespread vesicle production by cyanobacteria—these vesicles had an average diameter of ∼75 nm (Biller et al. 2014). It is evident that the quantification of marine viruses, despite advances, remains a topic of active interest from both a methodological and scientific perspective.
6.2.2 Elemental reservoirs in marine viruses
Chapter 2 presented a physical scaling model of the elemental content of virus particles (see Jover et al. (2014) for more details). As was explained, virus particles are nutrient rich; that is, they have high proportions of nitrogen and phosphorus compared with their carbon content relative to the stoichiometry of their marine microbial hosts. For example, viruses whose capsid diameters are approximately 60 nm are predicted to have a C:N:P ratio of 24:8:1. Contrast this with the Redfield ratio (106:16:1) for marine phytoplankton (Redfield et al. 1963). Alfred Redfield posited the existence (p.172) of a ubiquitous elemental ratio in phytoplankton and marine detritus, and this ratio is a convenient departure point for studies even if exceptions are widespread (Martiny et al. 2013). The stoichiometry of individual microbial hosts varies by strain (e.g., heterotrophs are thought to be more nutrient rich than cyanobacteria (Suttle 2007)). Moreover, stoichiometry depends on physiology; for example, cyanobacteria grown in phosphorus-replete conditions have greater cellular proportions of phosphorus than do cells grown in phosphorus-depleted conditions (Bertilsson et al. 2003). The carbon requirements of cellular life reflect its increased need for structural support relative to virus particles. By way of contrast, Gram-positive bacteria have a cell wall that ranges from 20 to 50 nm in thickness and is characterized by a carbon-rich peptidoglycan (Schleifer and Kandler 1972). In summary, marine viruses, despite their small size and relatively low biomass, are nutrient rich. In a collaboration with Steven Wilhelm and Alison Buchan (Jover et al. 2014), we asked, how much carbon, nitrogen, and phosphorus is partitioned in marine virus particles at environmental concentrations? As a corollary, can viruses represent a substantial subcomponent of the dissolved carbon and nutrient content in marine waters?
The marine environment includes both organic and inorganic forms of carbon and nutrients not bound in organisms, that is, small molecules, molecular aggregates, marine snow, and other large pellets (Azam and Long 2001; Weinbauer et al. 2011). Focusing on the organic components, a rough classification of the types of organic matter can be achieved via size separation into small and large particles. Operationally, dissolved organic matter (DOM) is defined as any organic matter that passes through a filter with pore sizes in the range 0.22–0.7 µm. In contrast, particulate organic matter (POM) is defined as any organic matter that remains on the filter. Given a filter with pore size of 0.5 µm, DOM would include organic matter whose cross section was ≤ 0.5 µm. The operational definition of DOM includes nearly all viruses, and probably some bacteria and archaea. Once DOM is isolated from the rest of the sample, further chemical methods can be used to measure relevant atomic constituents, including organic carbon, nitrogen, and phosphorus. Specifically in the case of phosphorus, the chemical methods of accounting (Lomas et al. 2010) are likely to include the phosphorus bound in nucleic acids inside a protein shell of virus particles. This point was raised in Jover et al. (2014). In summary, virus particles are likely to contribute to current assessments of the total dissolved organic phosphorus (DOP) in marine environments, but the specific contribution of viruses to the DOP pool has not been systematically assessed. The total concentration of C, N, or P atoms bound within marine virus particles is a product of per-virus elemental content and virus density. (p.173)
Therefore, the DOP depends on virus size and on the number of viruses. Virus sizes range from diameters on the order of 30 nm or smaller, like the Leviviridae R17 and MS2 (De Paepe and Taddei 2006), to the large T-even phages with diameters approaching 100 nm (De Paepe and Taddei 2006), and even to “giant” viruses that can exceed 400 nm in diameter (Raoult et al. 2004; Fischer et al. 2010).
What is the typical size of virus capsids in marine surface waters? Jennifer Brum used quantitative analysis of variation in virus morphology and virus size via transmission electron microscopy (qTEM) to answer this question (Brum et al. 2013). In Brum’s qTEM study, surface water samples were surveyed from the Mediterranean Sea, Red Sea, Arabian Sea, Indian Ocean, Atlantic Ocean, and Pacific Ocean. In all these surface water samples, the majority of viruses were nontailed, representing 65% to 80% of the total morphotypes. The remainder comprised myoviruses, podoviruses, and siphoviruses, in decreasing order of representation. Viruses, irrespective of morphotype, had capsid diameters that varied between approximately 20 nm and 200 nm, For tailed viruses, tail lengths of myoviruses were on the order of 150 nm; of siphoviruses, on the order of 210 nm; and of podoviruses, on the order of 15 nm. Despite this variation in type-specific sizes, the bulk of the variation found in marine viruses capsid sizes centered in the range of 50 to 70 nm (Figure 6.3). (p.174)
The next component necessary to estimate elemental reservoirs in virus communities is virus densities. Figure 6.4 presents a range of possible concentrations of dissolved organic carbon (DOC), dissolved organic nitrogen (DON) and DOP bound in virus populations whose densities are 108 ≤ V ≤ 1011/L with average capsid diameters between 30 and 100 nm. The choice of virus densities reflects current estimates of natural variability in the marine water column (Figure 6.2). DOP bound in viruses is predicted to range from 0.1 nM to 20 nM if attention is restricted to virus sizes suggested by Brum et al. (2013). Recall that virus tails do not contain phosphorus; therefore, the estimates of DOP are robust to variation in virus morphology. Applying a similar series of calculations yields ranges of DON from 0.1 nM to 100 nM and DOC from 0.2 nM to 400 nM. These estimates are underestimates, as they do not include the contributions of virus tails, which contain carbon and nitrogen. A quantitative analysis of elemental contribution suggests that the (p.175) contribution of tails is on the order of 10% that contained in virus heads (Jover et al. 2014).
How do these ranges of elemental reservoirs compare with the total dissolved elemental pools? Consider DOP first. Estimates of marine DOP in surface waters typically range from 50 to 250 nM (e.g., at oligotrophic sites in the Atlantic (Lomas et al. 2010), Pacific (Fujieki 2014), and Southern Oceans (Ruardij et al. 2005). Assuming that capsid diameters are 60 nm, then virus particles can represent greater than 5% of the DOP in the marine surface environment whenever the following condition is satisfied:
where V is in units of particles/mL and DOP is in units of nM. Viruses are predicted to be a significant fraction of DOP if their total population exceeds 1.75 × 107/mL when DOP= 50 nM, or 3.5 × 107/mL when DOP= 100 nM, or 7 × 107/mL when DOP= 200 nM. A similar analysis reveals that viruses are unlikely to be a significant fraction of either DON or DOC in the marine surface environment.
This finding, at least for DOP, suggests a concomitant need to analyze the size of elemental reservoirs partitioned in viruses, and the role of viruses in modifying the flux of elements through and out of the marine surface. Virus particles may decay owing to UV-induced deactivation, aggregate on or part of marine snow, or adsorb to hosts. The relative rates of each of these processes in the marine surface warrants investigation. Another possibility is that viruses may be specifically targeted by grazers or nonspecifically taken up by filter-feeding organisms. This hypothesis was first raised by Gonzalez and Suttle (1993). If there is 0.01 fg of phosphorus per virion, then digestion of 100 viruses could provide the necessary phosphorus requirements for division of a nanoflagellate (Jover et al. 2014). The frequency and relevance of virus-targeting grazers remains an open question.
6.3 Estimating Viral Diversity
6.3.1 Diversity metrics
Diversity is an old, old concept. In the beginning, at least according to Genesis, Adam “gave names to all the cattle and to the birds of the sky and to all the wild beasts” (Genesis 2:20). This naming distinguished one type of living organism from another. Scientists have, for years, done the same. The names of old (p.176) became the taxonomy of the near past, including kingdoms, orders, classes, families, and so on. In the study of microbes, and in particular of viruses, such demarcations are problematic. What distinguishes one bacterial species from the next? What distinguishes one virus species from the next? And do any such designations have biological meaning?
These questions remain highly debated and controversial, and the curious reader will readily find debates on such topics (May 1988; Ward 2002; Fraser et al. 2009; Mora et al. 2011). For now, it is worthwhile to step back and recognize that diversity is a concept meant to summarize the variation among distinguishable types of things. These things may be colors (of hats) or names (of people) or species (of organisms). When all the things in a group are of the same type, then diversity is low; when all of the things in a group are of different types, then diversity is high. How high and how low depend on the mathematical definitions. Viral diversity may be calculated with respect to genotype diversity, gene diversity, or even functional diversity. All such diversities require a metric, that is, a scale by which one arrangement of types can be compared with another.
Three of the most commonly estimated diversity metrics are Simpson diversity, Shannon diversity, and species richness (Schloss and Handelsman 2005). For a community with N individuals of S types, such that the number of individuals of each type is ni, then the diversity of the sample is defined as:
These equations may appear cryptic, but they have (relatively) simple interpretations. Simpson diversity is equivalent to the probability that two randomly chosen individuals from the community are of different types. Shannon diversity is equivalent to the entropy of the community composition, which may be intuitive for physicists and computer scientists, and for others, may appear as intuitive as defining an Urdu word with a Finnish word.2 In brief, (p.177)
Shannon diversity is a proxy for the extent to which it is possible to compress a signal containing the equivalent information in the community composition p = (n1/N, n2/N,…, nS/N). When each type appears in equal abundance, then ni = N/S, and the Shannon diversity, D1 = log S, is maximized, whereas when one type dominates, then D1 → 0. Finally, richness is, trivially, the number of different types in the sample. The use of subscripts 0, 1, and 2 for each of these diversity metrics is meant to convey another connection between a generalized family of diversity estimators called Hill diversities (Hill 1973; Haegeman et al. 2013; Chao et al. 2014).
As is apparent, the relative abundances ni of each type have different weights in these diversity metrics. For example, two communities may have different numbers of types and still have the same Simpson diversity. Whereas, richness depends exclusively on the number of types, S, and so one community with one common type and many rare types will appear more “diverse” (according to the richness metric) than another community with a smaller number of types, all relatively uncommon. All too often it is assumed that one metric is as good as another; that is, they are equivalent, at least in terms of their relative ordering. This is not true. Figure 6.5 shows how three different community compositions, each containing the same number of organisms, can each be the most diverse, depending on whether diversity is measured in (p.178) terms of richness, Shannon diversity, or Simpson diversity. Despite the use of the term “diversity” as a catch-all phrase, the appropriate diversity measure depends, ultimately, on the question. In the study of environmental viruses and microbes, many questions are comparative. For example, is the diversity at one location larger than at another, or does diversity vary with an environmental feature? The new challenge in asking such questions is that samples of viruses and microbes are nearly always a small portion of the community of interest. Therefore, methods are needed to infer the diversity of a community from observations of a sample. As it turns out, inferring the diversity of a large community from a small sample is not as straightforward as it may seem. This is the topic of the next section.
6.3.2 Robust estimation of community diversity from a sample
The diversity of a sample is readily estimatable given a protocol for distinguishing between types of individuals within the sample. The same cannot be said of inferring the diversity of a community from a small sample. This problem can be understood by thinking about black swans. Most swans—as even the inattentive observer on a nature hike will affirm—are white.3 A sequence of M measurements of the colors of swans in a sampled flock would, with near 100% certainty, be of the form W, W, W, …, W, where W denotes an observation of a white swan. The measured richness, , of the sample is 1, and the measured Simpson and Shannon diversity are . In other words, there is only one type of color and no variation in color within the sample. What then is the expected diversity of the entire population of swans? To estimate this diversity requires extrapolating beyond the current set of observations to the entire set of Ntot swans. Consider the scenario in which a single one of these Ntot swans is black. The true richness of the community is D0 = 2, and the true Simpson diversity is (asymptotically) D2 = 2/Ntot, whereas the true Shannon diversity is asymptotically D1 = log(Ntot)/Ntot. The probability of observing this black swan in a (random) sample of N swans is 1 − (1 − 1/Ntot)N, which can be approximated as N/Ntot so long as N ≪ Ntot. As the black swan becomes ever rarer, it is harder to detect in a sample but also has a decreasing consequence on the difference between the measured diversity and the actual diversity—so long as attention is restricted to Simpson or Shannon diversity. In contrast, even as the black swan becomes rarer, the disparity between the measured species richness and (p.179) actual species richness remains constant. This reflects the fact that richness weights all species equally, irrespective of their relative abundance, including rare species, whereas rare species are the most difficult to observe. Formally, if Sobs is the number of observed species in the sample, then the estimated number of species in the community, Ŝ, is
where Srare is the expected number of unobserved species in the community. The subscript implies that those unobserved species are relatively rare, at least with respect to the sampling intensity.
It is apparent that the sample diversity and the true diversity can differ. It is less apparent whether the differences are likely to be small or, potentially, biased. Whatever the diversity metric, efforts to estimate the diversity of a community from a sample must involve some form of extrapolation to estimate the value of Srare.
Consider a set of observations such that Fk is the number of species in a sample of size N such that each species is observed exactly k times. There are species, such that . Classic probability theory would suggest that the most likely probability of each species would be equal to its observed frequency; for example, a species observed three times in a sample of 100 should have a community abundance of 3%. This estimate is biased, because rare species are less likely to be observed than common species. Common species appear, on average, to be slightly more common in a finite sample. Let be the true community frequency of a species which is observed k times in a sample of size N. In this way, it is convenient to introduce the notion of , that is, the community frequency of an unobserved species. These species are considered rare, at least with respect to the sample. If there are Srare such rare species, then their true relative abundance in the community is . Estimating the number of species in a community is then dependent on estimating and by extension the number of unobserved species, Srare.
Observations of the rarest species in the sample serve as an upper limit for the relative abundance of individuals from unobserved species. I. J. Good and Alan Turing derived an accurate estimator of the relative abundance of unobserved species (Good 1953; Gale and Sampson 1995). The Good-Turing theory presumes that unobserved species should have a maximal relative abundance equal to the observed total relative abundance of the rarest observed species in the sample, that is, the singletons. This leads to the estimate that . The intuition behind this equivalence is that the prior (p.180) probability that a newly examined individual is a new species, unobserved in the sample, is equal to the fraction of singletons in the sample. The number of unobserved species is therefore . But what is ? To move forward, consider a similar upper limit for the relative abundance of species observed only once. This relative abundance should be equal to the observed relative abundance of the second-rarest observed species in the sample, that is, the doubletons (those species observed twice in the sample). Via this logic, . Finally, assume that that is, the true relative abundance of the rare species must be less than that of the singleton species. The expected number of rare species in the upper limit in which is:
A lower limit on the number of species in the community is:
This is known as the Chao-1 estimator (Chao 1984; Colwell et al. 2004). The key point is that Chao-1 estimator is a lower limit on the expected number of unobserved species in the community. It is not the expected number of unobserved species in the community. The use of the Chao-1 estimator requires sufficient number of observations such that F1 > 0 and F2 > 0.
In practice, it is all too often forgotten that the estimator provides a lower limit. This might seem like a technicality, but it is not. Lower bounds and actual estimates need not be coincident. For example, consider two tourists wandering bewildered on their first trip to Manhattan, on a rainy and cloudy fall morning. Walking downtown they observe a skyscraper, taller than the cloud cover. The tourists estimate that skyscraper A is at least 400 feet tall assuming 10 feet per observed floor. The day improves and the tourists soon eye another skyscraper, this one at least 600 feet tall assuming, again, 10 feet per observed floor. That afternoon, the clouds and rain finally break, and the tourists, this time, better oriented and walking back uptown, observe the true heights and identities of the buildings: skyscraper A, the Empire State Building, is 1250 feet tall, and skyscraper B, the Chrysler building, is only 1050 feet tall. In other words, lower limits and actual values need not always coincide. Useful comparative statements require both lower and upper limits so as to provide guidance on the expected ranges of values to be compared. The same logic as in the example can also be used to provide an upper limit on species richness given a sample abundance distribution, F1, F2,…, Fkmax.
(p.181) The preceding logic assumes that unobserved species are, at most, as common as singleton species in the community; it is also possible that they are less common. In the limit of extreme rarity, all unobserved species will have a true community relative abundance of 1/Ntot, where Ntot is the size of the community:
such that an upper limit on the number of species in a community is
Unlike the lower bound, the upper limit scales with the ratio between system and sample size. As mentioned before, this ratio often exceeds 1010 for under sampled viral communities. The gap between the lower limit and upper limit is very large. Therefore, lower bounds on viral species richness may, in fact, provide minimal useful information on the true, but unknown, viral species richness. It is difficult to compare the species richness of two communities without an accurate estimate of either. In summary, as in the analogy of estimating building heights by eye on cloudy days, quantitative methods used to infer species richness cannot see the part of the signal necessary to infer the property.
The consequences of this can be examined in a number of ways. For example, it is possible that two communities with vastly different true species richness values will have the same estimated lower limit on species richness according to Chao-1. As an illustrative example based on methods first presented in Haegeman et al. (2013), consider a microbial community sample from the surface oceans including 7068 distinct organisms that belong to 811 operational taxonomic units (OTUs) (Rusch et al. 2007). An OTU is a heuristic definition of a “species” in terms of sequence space, such that 16s rRNA sequences which have >97% similarity are clustered into a single OTU. An idealized community can be constructed from the sample data by assuming that the community includes additional rare species whose frequencies are each smaller than the most rare species in the sample; that is, they have true frequencies less than 1/7068. The cumulative relative abundance in the tail of the distribution should be F1/N. The number of species in the tail can vary, so long as the relative abundance of all rare species, . In practice, different rare tails can be used. Here a power-law tail with exponent −2 as a function of the rank of the species was utilized. In turn, the relative (p.182)
abundance in the community of each observed species in the sample must be corrected so that the total probability sums to unity. Applying a Good-Turing correction (Good 1953; Gale and Sampson 1995) leads to three in silico communities, each of which has 104, 105, and 106 distinct species, respectively, in the community.
A rarefaction curve, also known as a collector’s curve, can be generated from each community, s(n). The curve represents the expected number of unique species observed as a function of n collected individuals out of N in the sample. For a community with Strue species, each of which has a relative frequency of qi,
The rarefaction curve summarizes a community abundance distribution and contains the necessary information to extrapolate each of the diversity indices, D0, D1, and D2. What is striking about the three in silico communities is that their rarefaction curves are indistinguishable, despite having dramatically different species richness (Figure 6.6). The true species richness measured in (p.183) terms of OTUs could differ by multiple orders of magnitude, but this would not be apparent in the data. The Chao-1 estimator for this community is 4604. As was suggested earlier, the Chao-1 estimator is, in practice, an excellent lower bound on species richness, but it also not necessarily informative with respect to the actual species richness. For example, 4604 is much less than both 100,000 and 1 million species. In practice, the use of diversity metrics to estimate viral diversity of undersampled communities should focus on those estimators that measure diversity in a sufficiently weighted average. These estimators include Shannon diversity and Simpson diversity, though Simpson diversity is even more robust.
6.3.3 Diversity of viral genes and genotypes
The estimation of viral gene and genotype diversity is a new area of concern, made possible by technological advances in culture-independent sequencing of viral communities. This field of study is termed viral metagenomics (Edwards and Rohwer 2005). A hallmark 2002 study led by Mya Breitbart and Forest Rohwer analyzed ∼1000 short viral DNA sequences from 200 L of seawater. Each extracted sequence was on the order of 1 kbp in length (Breitbart et al. 2002). Most of these viral sequences were unlike anything anyone had examined before. As was reported, “over 65% of the sequences were not significantly similar to previously reported sequences, suggesting that much of the diversity is previously uncharacterized” (Breitbart et al. 2002). In comparison, a single liter of seawater contains on the order of 1010 viruses, and if each has a genome approximately 50 kbp in length, then the total number of kbp sequences in a 200 L community should be on the order of 1014. This is a factor of 1011 times greater than the sample.
The potential scope of a viral diversity study has increased nearly 1000-fold in terms of raw read length in the past decade, yet the problem of the unknown remains. A recent Pacific Ocean virome study presented a pipeline for organizing the sequence space into protein clusters (Hurwitz and Sullivan 2013). Prior knowledge with respect to the origins of sequences in the virome was described as follows: “the majority (87% photic zone, 91% aphotic zone) of the reads could not be classified based on sequence similarity to known taxa” (Hurwitz and Sullivan 2013). It is evident that any estimate of diversity will have to address questions that can be inferred given the limited amount of prior information available.
For other types of organisms, genome diversity is used as the basis for species definitions as well as phylogenomic classification. The OTU concept (p.184) for bacterial species is made possible by the fact that all bacterial cells share universal markers. The 16S rRNA gene encodes for a functionally active RNA component of the bacterial ribosome. There is no analogous universal gene for viruses. The study of viral taxonomy, viral phylogenomics, and genome-based definitions of viral species is ongoing. One proposal is to utilize genetic markers that apply to a subset of the viral biosphere and that might be used to differentiate subsets of taxa (Rohwer and Edwards 2002). An alternative is to use contig analysis to circumvent the lack of universal marker genes for viruses by using the degree of overlap among sequences from a viral metagenome (Breitbart et al. 2002).
A contig is a concatenation of one or more overlapping sequences. Contig analysis assumes that if two short sequences in a metagenome overlap sufficiently, then this indicates that they were derived from the same virus type. These overlaps constitute a contig spectrum, whose elements, cq, are defined as the number of contigs involving q sequences. For example, c1 is the number of sequences that do not overlap any other sequence, whereas c2 is the number of contigs involving a pair of sequences, and so on. Using this method, in silico communities were generated, each with different numbers of viral species. Contig spectra could then be derived from each community. The range of species richness values from synthetic communities whose contig spectra were similar to the measured contig spectra generated a posterior distribution from which to estimate viral diversity based on sample data. Alternatively, the Chao-1 estimator was also used to estimate lower bounds by interpreting the contig spectra as a form of species-frequency distribution. In brief, it was assumed that each contig represented a viral OTU and that the number of reads, q, in the contig denoted its abundance; hence, the values of ci could be used within the Chao richness estimation formalism. This direct approach, known as PHACCS (Phage Communities from Contig Spectrum) (Angly et al. 2005), estimated the number of viral types in the 200 L seawater samples described earlier as between 374 and 7114 (Breitbart et al. 2002). Similar methods were used to conclude that viral diversity in marine sediment ranges between “10,000 and 1 million viral genotypes” (Edwards and Rohwer 2005). In both cases, estimates vary by at least one, if not two, orders of magnitude. The difficulty stems not only from the challenge of estimating diversity of a viral type when no universal marker is available but also from the fundamental fact that estimating species richness, whatever the context is incredibly hard, particularly when doing so from a small sample of a large community.
An alternative view of viral diversity can be gleaned from examining the diversity of viral protein sequences rather than viral genotypes. Putative protein sequences are inferred by identifying open reading frames (ORFs) in viral (p.185)
metagenomes. The function of these ORFs remain largely uncharacterized. An increasingly popular method among empiricists is to group short sequences into protein clusters (Hurwitz and Sullivan 2013). The rationale is that doing so provides a lens into the diversity of distinct viral functional genes within a viral community. In some cases, such clustering may enable the generation of a hypothesis of a function for an ORF for which the function was not previously known. The drawback of such clustering is that given the large number of unknowns, the distinct ORFs in a cluster need not necessarily correspond to the same or similar function. Direct tests of virus protein functions in a community context remain elusive. Nonetheless, the same principle of estimating the diversity of species applies to protein clusters. Each cluster is considered a different type, and the number of sequences in a cluster denotes its abundance. The cluster-size distribution can be used to estimate the diversity of viral function of the sample and, by extension, that of the community.
An analysis of viral functional diversity is presented in Figure 6.7. The sample was taken from surface waters off Monterey Bay, where the community viral metagenome was sampled using procedures designed to maximize the yield of dsDNA marine viruses (John et al. 2011). Similar analyses (p.186) of rarefaction curves for virus protein clusters have been used to identify environmental correlates with changes in viral functional diversity (Hurwitz et al. 2014). The relative frequency of distinct types of functions may, eventually, provide information on the ways that virus-host interactions vary in the oceans, particularly in the face of competition with other viruses and with zooplankton, and given limited nutrients.
6.4 Virus-Microbe Infection Networks
6.4.1 From sensitivity relations to infection networks
The study of who infects whom has become a staple of virus-host research since Luria’s early publication on “sensitivity relations” between bacteria and phage (Luria 1945). The target host of cross infection includes interactions among viruses and (i) bacteria (Weitz et al. 2013), (ii) archaea (Ceballos et al. 2012), and (iii) microeukaryotes (Allen et al. 2007). In the pregenomics era, exposure to viruses was a way to identify whether a newly isolated bacterial clone was, in fact, distinct from others in a culture collection. Phage susceptibility was often able to detect differences in bacteria where serological tests involving binding to antibodies failed. Even in the genomics era, cross-infection assays are often more sensitive to genetic variation than is marker-gene-based sequencing. Whether a virus can infect a set of given hosts appears to be weakly correlated with phylogenetic distance, for example, as in infectivity patterns of viruses that infect marine cyanobacteria from the Synechococcus or Prochlorococcus genera (Sullivan et al. 2003). The rationale for such discordance is that phylogenetic divergence is quantified on the basis of conserved rRNA genes. Differences in phage susceptibility are often driven by variation in surface receptors, whether proteins or carbohydrates, or other pathways that evolve faster than do marker genes. There is a disconnect between measures of bacterial taxonomy and of phage susceptibility.
To directly link the sequence of a host and a sequence of a virus with a predictive model of infectivity remains an unsolved problem. From an ecological perspective, a cross-infection assay provides direct information on the extent to which populations have diverged in terms of a functionally relevant trait. Cross-infection assays have become part of standard protocols—at least until recently, one of the general suite of microbiological methods to be used in the interest of thoroughness—despite the fact that it isn’t readily apparent how the resulting data should be interpreted. The word interpreted is apt, as the data had been recalcitrant to quantitative analysis. (p.187) The generation of virus-host infection networks preceded the widespread adoption of methods to visualize, analyze, and interpret complex networks (Albert and Barabasi 2002; Newman 2003; Strogatz 2001). It may be relatively simple to assess whether two hosts or two viruses have different infection patterns, but finding patterns in larger networks is nontrivial. To understand why, consider a cross-infection assay with 10 hosts and 10 viruses. Such an experiment generates 100 data points, one for each of the elements of the 10 × 10 matrix, M. In matrix form, there are 10 rows, one for each host, and 10 columns, one for each virus, which corresponds to 10! = 10 × 9 × 8 × … × 1 combinations of distinct orderings to position one host per row, and a similar number for viruses. Thus, there are more than 1.3 × 1013 ways to visualize the network of interactions in a relatively small study. When there are 20 hosts and 20 viruses, the combinations rise to nearly 6 × 1036. Visual inspection of a matrix is unlikely to reveal patterns without prior information on the ordering of hosts and viruses, such as a phylogenetic tree.
What patterns might there be in a virus-host infection network derived from cross infections among natural populations? The study of ecological relationships such as food webs and plant-pollinator interactions provides some guidance. In a food web (Cohen 1978; Cohen et al. 1990), an edge is present if a specific predator can consume a specific prey item. In a plant-pollinator network, an edge is present if there is an observed association between a specific pollinator and a specific target plant (Memmott 1999; Bascompte et al. 2003; Bascompte and Jordano 2013). In both cases, a node represents a population. A key hallmark of analyses of both systems is that nodes differ in their degree of interaction; for example, there are both specialists and generalist predators just as there as specialist and generalist pollinators. For example, given an interaction matrix M of size m × n, one may define the connectance as C = E/(mn), where E is the number of nonzero entries in the matrix. In network representation, E is the number of edges. How many edges are likely? Consider that all interactions are equally likely, with probability p, and that there are the same number of nodes in both sets in the bipartite network, m = n. A similar set of assumptions was invoked by the mathematicians Paul Erdős and Alfréd Rényi (Erdős and Rényi 1960), who argued that the probability that a node in a unipartite random network with n nodes would have k interactions should be
(p.188) The variation in degree of interaction in a random network follows a binomial distribution, in which each type should have, on average, the same number of interactions. When n is sufficiently large and p ≪ 1, then the degree distribution within so-called Erdős-Rényi networks approaches that of a Poisson distribution with average degree , such that . In contrast, both food webs (Cohen 1978; Williams and Martinez 2000; Allesina et al. 2008) and plant-pollinator systems (Memmott 1999; Bascompte et al. 2003; Bascompte and Jordano 2007) have an unusually large number of nodes that have either many or few interactions. Indeed, the field of network science recognizes that the Erdős-Rényi network model is a useful point of departure for studying the actual structure of complex networks; but it also recognizes that this null model is a very poor model of actual structure, whether of social, biological, or physical interactions.
One way to consider deviations of actual networks from idealized networks is to compare the number of interactions, ki, of node i with the expected number. The expected number of interactions for each node in a bipartite network is either Cm or Cn, depending on which population type is examined. The number of interactions can be thought of as samples from the marginal distribution of the interaction matrix M. In the present context of virus-host interactions, the comparison between observed and idealized degrees of interaction may be posed as follows: within a group of bacteria, are some variants relatively more or less susceptible to infection? Similarly, within a group of viruses, are some variants relatively more or less able to infect target hosts? Further, are there unexpected patterns of relationships among viruses and hosts? The answers to all these questions are yes and are the subjects of the next two sections.
6.4.2 Target patterns within infection networks
Consider a bipartite infection network M of size m × n representing the interaction among m host types and n phage types. Each element in the infection network represents the outcome of an infection assay, such as a spot or plaque assay. Infection assays are scored in terms of whether or not there is evidence of infection. In this Boolean scheme, Mij = 0 denotes no infection, and Mij = 1 denotes successful infection between host type i and phage type j. In some circumstances, the quantitative level of infection, for example, in terms of plaque-forming units, may be measured, in which case the value of Mij may be a continuous response. The current discussion of infection network patterns focuses on analysis of Boolean infection networks, that is, those in which the entries in M are either 0 or 1. The extension of the present analysis (p.189)
to quantitative infection assays is ongoing. Given this notation, the number of links is E = ∑ij Mij. Further, ki = ∑ j Mij and dj = ∑i Mij define the degree of hosts and viruses, respectively. The degree of a node is defined as the number of interactions it has with nodes of the other type. For sufficiently large networks, it is also possible to consider the distribution of the degrees, ki and dj, a distribution termed the degree distribution of rows and columns, respectively.
What are some of the possible structures that can arise in a virus-host infection network if M is not random? A recent series of papers introduced a network-theoretic approach to identifying patterns that could emerge in the cross infection of environmental microbial hosts and viruses (Flores et al. 2011; Poisot et al. 2011, 2012; Beckett and Williams 2013; Weitz et al. 2013). Two key patterns are modularity and nestedness (Weitz et al. 2013), for which idealized networks are shown in Figure 6.8. The modularity of a matrix refers to the extent to which the interactions in a bipartite network occur predominantly within mutually exclusive groups of viruses and bacteria. That is, a module denotes a set of hosts and viruses for which the viruses tend to infect hosts within the module and not outside it, and for which the hosts tend (p.190) to be infected by viruses within the module and not outside it. For a matrix to be modular, the cross-infection structure should have a tendency toward an excess of in-group infections compared with a random matrix. The nestedness of a matrix refers to the extent to which the sets of interactions of hosts are embedded, one into another, and for which the sets of interactions of viruses are embedded, one into another. Such an embedding can occur, for example, if the host range of the most specialist virus is a subset of that of the second most specialist virus, whose host range is a subset of that of the third most specialist virus, and so on, until all host ranges are subsets of that of the most generalist virus. A similar embedding should occur for the virus range of hosts, that is, the set of viruses that can infect a given host. For a matrix to be nested, the cross-infection structure should have a tendency toward an excess of partial subsets within infection ranges compared with a random matrix.
Formally, the quantitative definition of the modularity of a bipartite network, as proposed in Barber (2007), is
Here, E is the number of interactions in the network, , R = [r1|r2|…|rm]T is an m × c index matrix, and T = [t1|t2|…|tn]T is an n × c index matrix, where c is the number of modules (Barber 2007). Here, R and T are Boolean matrices whose row sums must always be equal to 1. The column with the nonzero entry denotes the module associated with each host and virus (as indexed by the row). Thus, the modularity Q increases whenever there is an interaction between host i and virus j that have been grouped together in the same module. The lowest value of modularity is −1 and corresponds to the occurrence of interactions always between nodes from different modules in a very large network. In practice, such low values are rarely realized and usually result from a poor assignment of nodes into modules. The highest value of modularity is 1 and corresponds to the occurrence of interactions always between nodes of the same module in a very large network. Algorithms to quantify the modularity of a matrix must try to find the module assignments embedded in R and T that maximize Q. The combinations increase exponentially with the size of the network. Moreover, the number of modules is usually not known in advance. Multiple values of c must be tested, increasing the complexity of the task. In response, heuristic methods have been introduced to try and identify nearly optimal solutions. The widely used Bipartite, Recursively Induced Modules (BRIM) method is now available in a number of software packages (Barber 2007). The interested (p.191) reader may consult the software and documentation associated with libraries for more details, for example, bipartite (an R-based library) (Dormann et al. 2008) or BiMAT (a MATLAB-based library) (Flores et al. 2014), for more information on implementation.
The quantitative definition of nestedness includes alternative metrics. Two of the most commonly used are the NTC (Nestedness Temperature Calculator) (Atmar and Patterson 1993) and NODF (nestedness metric based on overlap and decreasing fill) (Almeida-Neto et al. 2008). Despite these rather opaque names, both aim to quantify the degree to which cross-infection structure can be ordered in terms of nested sets. In NTC, the orders of the rows (bacteria) and viruses (columns) are shuffled so that many of the interactions appear as sets of each other. The convention here is that the most susceptible host is in the topmost row, and the most generalist virus is in the leftmost column (Figure 6.8). In practice, the heuristic algorithm of sorting rows and columns by their degrees, ki and dj, respectively, is nearly as effective as complex alternatives (Flores et al. 2011). Then, given an ordering of rows and columns, the value of nestedness depends on the extent to which the interactions differ from that of a perfectly nested matrix given the same size and number of interactions. The details of this method are presented in Weitz et al. (2013).
In NODF, the nestedness is defined as
In this definition, ri and ci are vectors that represents row i and column i, respectively, of the bipartite adjacency matrix, and δ = 1 only when its two arguments are equal. By way of example, consider two rows of a 20 × 20 interaction matrix whose degrees differ (e.g., ki = 10 and kj = 5). Then, the potential contribution from such a pairwise combination should occur when the five interactions of row j are also present in row i. In that scenario,
(p.192) ri · rj = 5, and so Nij = 1. Alternatively, consider the case where none of the five interactions in row j is present in row i, in which case ri · rj = 0, and so Nij = 0. Finally, consider the case where ki = 5 = kj. The value of Nij must be zero in this case, because 1 − δ(5, 5) = 0; there cannot be decreasing filling between rows with the same number of interactions. In this way, NODF defines nestedness as the fraction of the m(m − 1)/2 row pairs and of the n(n − 1)/2 column pairs whose interaction ranges are subsets of one another. This method does not depend on ordering or labeling, though ordering of the matrix will help in visualizing the pattern. Multiple implementations are available for both the NTC and NODF metrics (Beckett et al. 2014; Flores et al. 2014).
The significance of any measured value of modularity or nestedness depends on a comparison with a null model. In other words, how different is the value of Q or NNTC or NNODF from what might be expected within an ensemble of random networks? The term ensemble refers to a collection of items that have similar, but not identical, properties. Indeed, the ensemble of random networks is meant to preserve some of the properties of the observed network. The two features that are invariably preserved in an ensemble are the size of the network and the average number of interactions, E. That is, every network in the ensemble will have size m × n, and the average number of interactions in a random network in the ensemble should approach E. The Erdős-Rényi prescription for generating networks represents one class of null models. Another constraint can include the requirement that the degree of interactions in the observed network is preserved in the ensemble, either strictly or on average. Degree-constrained random networks represent a more restrictive class of null models. Whatever the choice, the significance of a pattern and the size of its effect can then be calculated by comparing the observed feature of the network with that of networks in the ensemble (Weitz et al. 2013).
It is important to keep in mind the complex relationship between degree distributions and network patterns, including nestedness and modularity. For example, consider the constraint that the degree distribution of a network must be strictly preserved. Further, assume that the network is perfectly nested, and the degree distribution is uniform. There is only one perfectly nested network with a given uniform degree distribution (if the node labels are fixed and specified). To understand why, consider the original network M. To preserve the degree of the original network requires identifying a pair of bacteria i and i′ and a pair of viruses j and j′ such that only j infects i, and j′ infects i′. If that were the case, then one could swap the infections, thereby creating a new network that preserves the degree of all nodes (including that (p.193) of i, i′, j, and j′), as in the following schematic:
No such pairs of bacteria and viruses exist in a perfectly nested network. If virus type j is relatively more generalist than type j′, it must, in a perfectly nested network, also infect i′, because it infects all the hosts that any viruses below it in the ranking infect. Or, if type j is relatively more specialist than type j′, it must, in a perfectly nested network, also infect i, because its host range includes all hosts infected by virus type j. Similar logic holds for the bacteria. Should one be surprised to observe nestedness given a particular set of degree distributions? No. But should one be surprised to observe that particular set of degrees? Yes, and even more so if one takes as a null model the Erdős-Rényi network, which tends to give a unimodal rather than a uniform degree distribution. The statistical consequences of any of these choices are the subjects of active debate (Fortunato and Barthélemy 2007; Ulrich et al. 2009; Fortuna et al. 2010). The next section considers random ensembles based on preserving the size and average connectivity of the network.
6.4.3 Features of infection networks within the marine environment
The study of phage-bacteria infection networks in the marine environment owes a debt to the pioneering work of Dr. Karl-Heinz Moebus. Moebus was based at the Helgoland marine station, located on the island of Helgoland in the North Sea off the northwest coast of Germany. There, he and colleagues examined the relationship between viruses and microbial hosts both from the marine station and on cruises in the North Atlantic. The protocols Moebus refined (Moebus 1980; Moebus and Nattkemper 1981) have become, with some updating, standard practice in environmental virus isolation (Wilhelm et al. 2010). In brief, Moebus isolated environmental bacteria using growth media that favored fast-growing heterotrophic microbes, such as members of the Vibrio clade. Then, he exposed populations of bacteria grown from a single (p.194) culture to bacteria-free seawater that contained marine viruses and evaluated them for lysis. When clearing took place, the contents of the flask presumably held many of the viral progeny that resulted from infection and lysis of the target host. Moebus then plated these viruses back on a lawn of the target host population, picked individual plaques, and then grew them again on the target host to yield a virus isolate. With a growing number of microbial hosts and viruses in the Helgoland collections, it became possible to begin to evaluate a virus for the potential to infect other hosts.
The most striking example of Moebus’s cross-infection work was published in 1981 in Helgoland’s own research journal: Helgoländer Meeresunter-suchungen (Moebus and Nattkemper 1981). In it, “phage-host cross-reaction tests were performed with 774 bacterial strains and 298 bacteriophages.” Individual strains were collected at a series of 48 stations in the Atlantic Ocean. The total scope of the experiments represents more than 230,000 pairwise infection assays. Many of the hosts were resistant to all isolated phage. Further, the only way to distinguish a host strain in this assay was via its susceptibility profile. As a consequence, only a reduced subset of the interaction data was reported in a foldout table in the original journal publication. The computer-assisted digitized result focusing on the interactions between 286 host types and 215 phage types is depicted in Figure 6.9 (Flores et al. 2013). The hosts and viruses were, predominantly ordered by station. There is evidently blockiness to the pattern of infections, perhaps associated with geography given the route of the expedition from east to west across the Atlantic, but any visual analysis is based on but one of the very large number of permutations. The authors did notice that “sensitivity marks are distributed unevenly” and that there appeared to be clusters of interacting types. They attributed these clusters to differences in infectivity arising from geographic separation.
The advantages of network-based analyses is evident when the same dataset is visualized, albeit in a layout highlighting the modularity inherent within the data (Figure 6.10). Here, the rows and columns have been shifted so that those phage that tend to infect the same hosts are grouped together. In addition, blocks have been drawn for each cluster corresponding to the modules identified by the BRIM algorithm (Flores et al. 2013). There are 49 blocks in which 94% of all interactions occurred within modules. The measured modularity of Q = 0.795 was significant (p < 10−5) and of large effect. Although the composition of modules does exhibit a station effect, many hosts and viruses from different stations are included inside modules. Statistical analysis revealed that the number of stations represented inside a module was less than that expected by chance; hence, phage and host isolated at a subset of stations tend to constitute modules. These stations have a geographic (p.195)
signal, but it is not so clear as to demarcate geographically colocated stations. Individual viruses were found to infect hosts located thousands of kilometers away. Similar analyses of phage-host interactions from soil communities, for example, those of Pseudomonas aeruginosa isolates and associate phage, have found that viruses are better adapted to co-occurring bacteria than to bacteria isolated from other habitats (Koskella et al. 2011; Buckling and Brockhurst 2012). Extending such logic to the oceans requires consideration of the physical flow of water masses as well as environmental factors that shape the composition of hosts and viruses. For example, does coevolution among viruses and hosts drive divergence in infectivity, for example, owing to (p.196)
specialization on those hosts in the community? Or is allopatric separation of hosts followed by specialization of viruses on locally adapted hosts the norm? Answers to these questions cannot be found in Moebus’s study, as it predated the standard use of genetic analysis of microbes and their viruses. Perhaps this is why this early work is often ignored, but it should not be.
Indeed, the scope of the Moebus study contrasts with the typical size of cross-infection assays. A meta-analysis of 38 distinct virus-host infection networks found that the typical size of such assays included 19 host isolates (p.197)
and 11 viral isolates, of which 65 of the nearly 200 possible interactions were positive (Flores et al. 2011). The studies largely focused on a set of closely related bacteria; for example, given available information, the taxonomic classification of bacteria in any given study differed, at most, at the genus level. Given that scale, a significantly nested pattern was observed in 27 of the 38 studies. Two examples are included in Figure 6.11, both taken from studies that focused on marine systems—one involving Flavobacterium psychrophilum, a fish pathogen, and the other involving Cellulophaga baltica, a marine heterotroph. As shown in the previous chapter, nestedness may be the result of a coevolutionary process in which hosts increase their resistance in response to phage targeting and in which phage expand their host range to target these newly evolved hosts. Investigation of the internal structure of modules in the Moebus study reveals that 7/13 of the largest modules are significantly nested (Flores et al. 2013). Different processes involving both specialization—associated with enhanced modularity—and coevolutionary range expansions—associated with enhanced nestedness—may be operating concurrently. Or, as theoretical work has suggested, (p.198) a single coevolutionary mechanism could be responsible for both patterns, with information on quantitative levels of infection seemingly crucial to providing further insight (Beckett and Williams 2013).
In addition to foundational questions regarding coevolution and ecology, the study of complex infection networks in natural systems raises a number of issues of practical concern: for example, which hosts do viruses infect, and can the target host of a virus be predicted from sequence alone? It is evident that viruses are, relatively speaking, specialists when compared with other sources of microbial mortality, such as grazing. Like many topics in phage biology and ecology, it would seem that Salvador Luria anticipated many of the challenges and limits ahead. Luria (1945) speculated on the potential for widespread cross infection:
It is interesting, from the standpoint of bacterial taxonomy, that while bacterial viruses may be active on species belonging to different genera, chiefly within the family Enterobacteriaceae, no virus has ever been found to be active on members of different families.
Recently, Hyman and Abedon (Hyman and Abedon 2010) and Meaden and Koskella (Meaden and Koskella 2013) compiled examples of published studies in which a single phage isolate was shown to infect and lyse on multiple genera. Three such examples are (i) Pseudomonas and Erwinia (Koskella and Meaden 2013); (ii) Salmonella, Klebsiella, and Escherichia (Bielke et al. 2007); and (iii) Pseudomonas, Sphaerotilus, and Escherichia (Jensen et al. 1998). To this list must be added individual cyanophage isolates that can infect individual cyanobacteria from either Prochlorococcus or Synechococcus (Sullivan et al. 2003). In this last case, phylogeny provided some information on the likelihood of cross infectivity and, perhaps, even the basis for host-range variation in the isolated phage. For example, the morphotype of isolated viruses was correlated with the ecological niche in which the host is found; for example, podoviruses were isolated from high-light-adapted Prochlorococcus, and myoviruses were predominantly isolated from Synechococcus strains (Sullivan et al. 2003). The myoviruses had a much broader host range than did podoviruses; for example, myoviruses were able to infect isolates from both genera, whereas the podoviruses were able to infect at most one additional host isolate beyond that on which they were isolated. It remains an open question whether similar principles relating morphology to cross-infectivity exist in other marine host-virus systems.
To move forward, new methods are needed to probe the statistical limits to cross-infectivity, keeping in mind the following principles: (i) even a single base pair change can modify the degree of permissiveness to viral infection from highly permissive to completely resistant; (ii) there are almost (p.199) no reported instances of cross infection of two hosts by a single virus isolate if those hosts differ taxonomically at the family level or above. The study of Jensen et al. (1998) is notable for being perhaps the only published claim of a phage whose host range includes bacterial isolates from distinct families. Culture-independent methods are likely to be essential to probing the network of cross infection in marine environments and elsewhere. In this same vein, the study of Moebus and Nattkemper (1981) quantified two types of infection levels, possibly attributed to lytic and lysogenic events. A recent study of Cellulophaga baltica, a ubiquitous ocean heterotroph, found that differences in infection strength may be the results of distinct environments (Holmfeldt et al. 2014). In nutrient-enriched environments, lysis was favored, whereas in nutrient-depleted environments, lysogeny was favored. This finding is consistent with large-scale environmental assays of the frequency of lysogeny (Laybourn-Parry et al. 2007; Payet and Suttle 2013; Hurwitz et al. 2014). A sharper focus on quantitative rates of lysis and alternative cells fate will be required to better understand cross infection.
Finally, the study of any complex network requires some consideration of the effect of sampling. In the study of food webs and mutualistic networks, the issue of sampling arises because not all links are measurable (Bersier et al. 1999; Martinez et al. 1999; Poisot et al. 2012). For example, a predator may not eat a particular prey while under observation, or the prey may not be identifiable from gut content analysis. In the case of virus-host networks, another type of sampling issue arises: the majority of viruses and hosts in the surface ocean cannot yet be cultured. This implies a sampling challenge of nodes and associated links in the network. Methods to directly assess virus-host interactions, including viral tagging (Deng et al. 2014) and sequencing of polonies will be part of any effort to quantify infection networks among both culturable and nonculturable viruses and hosts.
• Viruses are extremely abundant in the oceans, with estimates of virus-like particle densities ranging from approximately 104 to 108/ml.
• Total virus abundance correlates with prokaryotic abundance and other abiotic proxies, including chlorophyll.
• Virus abundance is estimated to be at its highest in coastal environments, during blooms, and in sediments.
• Viral diversity remains elusive. Those features of viral diversity that are estimable include Shannon and Simpson diversity and should be utilized instead of attempting to estimate the total number (p.200) of virus “species” in the community based on measurements of a small subsample.
• Viral diversity includes genotypic, genetic, and functional diversity.
• It is evident that individual viruses infect more than one host type, and individual hosts are infected by more than one virus.
• The cross-infection networks in natural systems include evidence of specialization, as measured by modularity, and hierarchical order, as measured by nestedness.
• The basis for emergent infection networks remains unresolved.
• A full accounting of viral diversity, whether genetic or functional, requires improvements in culture-independent methods to probe virus-host interactions.
(1) Thanks to Steven Wilhelm for suggesting this turn of phrase.
(2) Claude Shannon, after which the Shannon diversity is named, is considered the father of information theory. This field originated largely from his pioneering master’s thesis. It is less well known that Shannon’s PhD dissertation was on the subject of theoretical population genetics.
(3) Black swans, of the species Cygnus atratus, do exist and are endemic to Australia and New Zealand.