Finally, genetic correlation estimates based on full sib pairs alone, in which most pairings are not intergenerational, are shown in SI Appendix, Figs. 8–10 as well as Dataset S2 and were generally consistent with analyses based on all family relations.


In this study, we present the Danish genealogy constructed from the Danish Civil Registration System, which holds information on all individuals born or with residence in Denmark since 1968. The genealogy extends back up to six generations, with the oldest connected individuals being born in 1872 and the youngest in 2017. We partitioned 6,691,426 Danish citizens into eight demographic cohorts based on year of birth. Notably, by cross-linking the Danish Civil Registration System with hospital discharge diagnoses from the public and egalitarian Danish health care system, we were able estimate heritability and genetic correlations for 10 broad diagnostic categories encompassing all major organ systems and most ICD-8/ICD-10 codes while describing the epidemiological biases of truncation and right censoring in the oldest and youngest demographic cohorts, respectively.

The heritability of single diseases and genetic correlations between them have been studied extensively not only in family data but also thanks to the development and application of genome-wide association studies to thousands of human traits (17). In a few instances (e.g., for mental disorders), genetic risk variants shared across diagnoses with clearly distinct clinical characteristics and age of onset have been identified (18). However, neither the heritability nor the genetic correlations have been systematically studied for organ-defined disease categories as grouped by 10 chapters of ICD-10. In addition, such studies have never been carried out within a single population such as the Danish, serviced and monitored uniformly for decades by an egalitarian health care system.

We estimate the heritability to be high for several of the 10 disease categories. This is consistent with high genetic correlation between individual diagnoses within each category as reported for mental disorders (18) and more broadly for brain disorders (19) as well as with the broader notion that genetic liability is generally organ specific. For mental conditions in particular, heritability point estimates reach 0.7, which is higher than reported for the common and less heritable mental disorders such as depression (0.4) (20) and anxiety (0.3∼0.4) (21) and similar to those for highly heritable, rare illness, such as schizophrenia (0.81) (22) and bipolar disorder (0.6∼0.8) (23).

Moreover, the lower heritability estimates in older cohorts and the higher heritability estimates in younger cohorts might be because disease risk is generally plateauing in the former, whereas accumulation of diagnoses in the latter is an ongoing process, interrupted by right censoring. Younger cohorts are therefore enriched for younger ages of onset, which in many instances go along with stronger genetic signals and higher heritability estimates as known for mental disorders in which early onset disorders such as autism and attention-deficit/hyperactivity disorder are commonplace. It could also be posited that the accumulation of environmental exposures throughout life renders nongenetic factors more important in aging-related conditions, thus resulting in overall lower heritability estimates in older cohorts. On the other hand, stronger genetic correlations in older cohorts might be due to the accumulation of comorbidities in older cohorts compared to younger cohorts.

The fact that genetic correlations were almost exclusively positive across all cohorts probably reflects how diseases, at least in the broad composite definitions we use in this work, are problems of the normal functioning of organs and systems, whereby the disorganization of one or more of them should be detrimental for others, ultimately resulting in further pathology. The positive genetic correlations match comorbidity observations in the clinical domain.

Notably, we observe that the ranking of heritability and average genetic correlation estimates compare for most of the 10 diagnostic groups, although there are also marked exceptions. Mental, gastrointestinal, and circulatory conditions rank high both for heritability and average genetic correlation, whereas neurological conditions, despite showing the lowest heritability estimates, are genetically highly correlated with the other diagnostic groups, implicating broadly the etiology of disease affecting the nervous system in disorders of most other organ systems. Contrary, other low-heritability groups, such as cancer and musculoskeletal conditions, have low genetic correlations suggestive of their etiologies being dominated by disease-specific, environmental exposures and somatic mutations for the former and accidents for the latter. Similarly, endocrine conditions, dominated by type 2 diabetes, have relatively low heritability, possibly reflecting behavioral causes.

While circulatory and gastrointestinal conditions are the most heritable and genetically correlated diagnostic categories, their patterns of genetic correlation with other diagnostic categories are nonetheless highly diverse. In fact, gastrointestinal conditions were clustered with neurological and mental disorders, and while the clustering of the two latter disease categories of the nervous system could be anticipated and possibly reflects organ-specific components of their heritability, their proximity to gastrointestinal conditions is notable and may stem from the extensive innervation that underlies the gut–brain axis and the proposed relation between gut microbiota for brain functioning and mental health (24). Contrary to the proximal clustering of brain and gut disorders reflecting shared organ specificity or functionalities, that of endocrine with circulatory conditions as well as that of cancers with hematological illnesses more likely reflects sequelae in which one illness is a consequence or complication of a prior and otherwise, unrelated condition, in case, diabetes leading to circulatory complications and cancers to anemia because of bleeding from internal organs.

Although the reconstructed Danish genealogy is limited to six generations and thus dates back in time considerably less than the genealogy of Iceland (25), we note that most diagnostic categories include between a quarter- and a-half-million individuals, making this genealogy dataset highly apt for studies of heritability, genetic correlations, and the impact of behavioral and environmental changes over time. Also in comparison with Iceland, the relative shallowness of the reconstructed Danish genealogy, compared to, for instance, the much deeper Icelandic pedigree dating back to the 11th century (25), renders linking distant relatives a challenging task and supports the use of classical relative pair-based methods rather than linear mixed models. Furthermore, truncation and censoring biases in the oldest and youngest cohorts, as well as changes in the environment and clinical practices over time, favor the use of horizontal over vertical familial relationships and justify the stratification of the analysis by demographic cohort rather as opposed to a single analysis across the entire genealogy.

While this dataset is ideally poised for quantitative genetic analyses, it also presents limitations. As already discussed, truncation in the older demographic cohorts and right censoring in the younger ones can introduce bias to heritability and genetic correlation estimates. In order to explore the effects and biases of time, we split the available data into eight demographic cohorts and show that the four midmost cohorts—that is, the ones least affected by truncation and censoring—yield consistent estimates.

In addition, given the lack of genetic data, we have no means to safeguard our analysis from false paternities and adoptions. As a result, a small portion of the ascertained familial relationships may be overstated, affecting our heritability and genetic correlation estimates. Nevertheless, given the high abundance of relative pairs, we believe that the effect of these biases is limited. Similarly, the lack of parental links before the timeframe of the registries will lead to understating distant familial relationships, which could bias heritability estimates based on frameworks that utilize the entire relationship matrix such as linear mixed models. However, because our estimates are based on known family pairs, we believe that issues coming from an underestimation of familial relationships are limited in our analysis.

Furthermore, modifications in the diagnostic classification system, which changed from ICD-8 to ICD-10 in 1995, and the registration of outpatient contacts that began in the same year (9) may complicate precise tracking across demographic cohorts, although the focus on broad diagnostic categories in this study is expected to reduce this bias.

Finally, our analyses make no attempt to distinguish a priori between genetic correlation resulting from pleiotropy and co-occurrence of disease in relatives because of sequelae as discussed in the seventh paragraph of Discussion for cancers and anemia.

For mental disorders, the relatively high frequency in the younger cohorts coincides with the introduction of novel child and adolescent disorders in ICD-10—that is, attention-deficit/hyperactivity disorder and autism. Similarly, pulmonary conditions show increasing frequencies in younger generations consistent with increasing worldwide prevalence of smoking and asthma in young people (26). While potentially biasing our findings, changes in disease frequency across time also constitutes an entirely novel research field opening for the identification of nongenetic factors independently or through gene-environment interactions influencing risk of disease. In fact, as the habit of smoking spreads and increases during the middle of the 20th century (26) and the prevalence of pulmonary and circulatory conditions increases correspondingly, the heritability is expected to decrease; thus, modeling a shared environment in households will allow for studies seeking to identify nongenetic factors that impact disease. Such analyses can be empowered by the knowledge of geographical (co)location of the residence of Danish citizens from cradle to grave as a proxy for shared environment.

In conclusion, here we presented the Danish genealogy as a resource that, in combination with the National Health Registers, allows whole-population, quantitative genetic analysis with applications to health sciences. The presented resource and analytical framework will contribute to the advancement of precision medicine, allowing the systematic mapping of heritabilities and genetic correlations of comorbidity patterns and sub-diagnostic traits such as age of onset and treatment response and to inform on clinically relevant phenomena such as assortative mating, nonadditive genetics, and shared environment. While this and similar genealogies from the Nordic countries represent unique resources (14), the changes in biases, environment, and clinical practices necessitate the integration of time-dependent and survival analysis frameworks. Explicit modeling of biases is warranted to fully exploit the oldest and youngest generations.

Materials and Methods

Danish Civil Registration System.

The Danish Civil Registration System was established in 1968, registering all people alive and living in Denmark since then (7, 8). The Danish Civil Registration System includes personal identification number, sex, date of birth, and continuously updated information on vital status (e.g., migration or death). The personal identification number is virtually immutable, thus enabling accurate links across different registers. As of April 2017, the system contained 9,851,330 individuals born between January 1, 1858, and April 21, 2017.

Danish National Patient Register.

The Danish National Patient Register (6) includes the medical records of all patients treated in Danish general hospital inpatient departments since January 1, 1977, as well as in outpatient clinics since 1 January 1994 (or occasionally since 1995). Since 2002, the Register also includes Danish patients treated in hospitals outside Denmark and activities in specialist medical practices not paid by the health insurance agreement. As of April 2017, the register contained 287,593,154 records with diagnostic information for the 135,070,194 patient contacts available in the dataset.

Danish Psychiatric Central Research Register.

The Danish Psychiatric Central Research Register (5) was first computerized in 1969 and includes admissions to psychiatric inpatient facilities up to and including 1994. Since 1995, the Register also contains outpatient contacts to psychiatric departments. As of April 2017, the register contains 7,298,910 records with diagnostic information for the 4,826,984 psychiatric hospital contacts.

Ethics Approval.

This study was approved by the Danish Health Data Authority (project no. FSEID-00003339) and the Danish Data Protection Agency. By Danish law, informed consent is not required for register-based studies.

Data Cleaning.

The most important requirement for accurately establishing pairwise familial relationships is that any given individual has either no register links to their parents—that is, they are a founder—or both register links to their parents. This is to guarantee that familial relationships are not underestimated (e.g., incorrectly ascertaining half siblings instead of full siblings). Bearing this in mind, the 2017 Danish Civil Registration System data freeze includes 1) 198,892 individuals with only one parental link, 2) five individuals with two identical parental personal identification numbers, 3) 880 individuals that are adopted, 4) 3,000 individuals with same-sex parents, and 5) 123,331 individuals belonging to twin pairs/multiple births. There is overlap in the above five categories. In order to yield as many pairwise relationships as possible, instead of eliminating the aforementioned individuals, we converted them into founders—that is, we eliminated their parental links. Thus, if said individuals have descendants that meet our two-parent criterion, we can include their pairwise familial relationships in our analyses.


Genealogies can be analyzed as graphs—that is, a set of nodes (individuals) that are joined by edges representing parent–offspring relationships (27). Bearing this in mind, we used the networkx module in Python (28) to explore network connectivity in our data.

After eliminating invalid parental links, we converted the data into an edge list and loaded it as an undirected graph. Each edge in the graph represents a parent–offspring relationship between two nodes. If the parents of an individual are known, then two edges are added to the list (one for each parent). If no parental information is available—that is, in the case of founders—no edge is added to the list. Individuals can be entirely unconnected (singletons)—that is, they present no parental or offspring links.

The list consisted of 8,848,128 edges involving 6,801,107 individuals—that is, ∼69.04% of all available individuals in the Register. The remaining 3,050,223 individuals (30.96%) were singletons. A bit over half of those singletons (1,753,057 or 57.47%) were born in Denmark or Greenland, whereas the rest were born elsewhere. The distribution of the singletons by demographic cohort is shown in SI Appendix, Fig. 11. Overall, singletons born in Denmark belong to older demographic cohorts and represent childless individuals with no parental links, whereas singletons born outside of Denmark belong to younger demographic cohorts and represent immigrants without familial links in Denmark.

networkx computes the number and size of components—that is, the network subsets that are completely unconnected from all other subsets. This process returned one large component (n = 5,396,661) and 251,513 significantly smaller ones (n = 1,404,446), among which there were 100,400 trios and 58,804 quartets (Fig. 1 A and B). The single largest network component includes 54.78% of all registered individuals and 79.35% of the individuals with at least one relative (Fig. 1A). The overwhelming majority of the connected individuals (88.47%) were born in Denmark or Greenland. The distribution of the connected individuals by demographic cohort is shown in SI Appendix, Fig. 11.

Graph topology also indicated that the 6,801,107 connected individuals span only six generations; of those individuals, 2,377,043 (∼34.95%) are founders—that is, they have no parental links. The narrow generation span combined with the high number of founders has implications in the ascertainment of distant relative pairs.


We used the pydigree module in Python (29) in order to estimate all nonzero pairwise coefficients of expected relatedness


for the 6,801,107 connected individuals. pydigree reads a file in pedigree (PED) format as a directed acyclic graph and enumerates all legitimate paths connecting a given pair of individuals. From any given starting point, only paths toward previous generations are allowed as well as one change of direction at most. The lengths


of the paths connecting a pair of individuals are used to estimate their kinship coefficient φ (30, 31):


We note that


is twice the kinship coefficient φ.

To avoid looping over unconnected individuals, we applied the procedure only to each of the 2,377,043 founders with their corresponding descendants (easily identified with pydigree). Because different founders can share descendants, we removed duplicate estimates with a Python script. Kinship coefficients for unreported pairs were assumed to be 0.

As a result of the above procedure, we obtained 44,099,130 pairs of familial relationships from the large pedigree and 4,522,710 from the rest of the smaller pedigrees, totaling 48,621,840.

Apart from the value of


for any given pair of individuals, we registered the number of all possible connecting paths and their corresponding length as well as node depth of each individual in the path. Combined with


, this additional topological information allowed us to annotate the familial relationships with great accuracy (SI Appendix, Table 5).

The distribution of number of an individual’s relatives is heavily right skewed with a long tail (mean = 12.3; median = 9; mode = 6; Fig. 1C). Moreover, the distribution of number of meioses between connected individuals, considering the shortest path per pair, is also right skewed with mean = 2.7, median = 3, and mode = 2. This implies that the ascertained relative pairs in the Danish genealogy are dominated by close familial relationships.

Only a negligible fraction (0.03%) of the annotated familial relationships were connected by more than two paths, consistent with very few consanguineous relationships or marriage loops in the population, and these pairs were discarded from further analyses.


In this work, we focused on 10 broad diagnostic categories that correspond to the definitions used in a recent publication (15). These were conditions of the 1) circulatory system, 2) endocrine system, 3) pulmonary system including allergies, 4) gastrointestinal system, 5) urogenital system, 6) musculoskeletal system, 7) hematological system, and 8) neurological system as well as 9) cancers and 10) mental conditions. Each of these broad diagnostic categories is a composite measure of presence or absence of any disease falling within the specific diagnostic category (SI Appendix, Table 1).

In general, if an individual has an in- or outpatient hospital admission or contact for one of the above medical conditions in the Danish National Patient Register and/or the Danish Psychiatric Central Research Register, we ascertain said individual as a case for said condition, with no respect to contact frequency or comorbidities—that is, diagnostic categories were not mutually exclusive. We considered both ICD-8 and ICD-10 codes for the ascertainment of a given phenotype, even though it is important to note that there is not always a 1-to-1 correspondence between the two coding systems. Only diagnoses coded as “main” or “auxiliary” were considered for the phenotyping (as opposed to “basic,” “referral,” “temporary,” and “complication”).

In general, this study considered all diagnoses assigned in relation to an in- or outpatient hospital admission or contact as recorded systematically in the Danish National Patient Register and/or the Danish Psychiatric Central Research Register.

Individuals with no entries for a given condition were treated as controls for said condition. However, this strategy is vulnerable to truncation and censoring biases because health records are not quantitatively or qualitatively homogeneous across demographic cohorts. To minimize the risk of including too many false controls in the control group, we only studied individuals who were alive and living in Denmark after January 1, 1977 (date on which the Danish National Patient Register was established) or born in the interval (January 1, 1977, to January 1, 2017). As a result, we ended up with a subset of 6,691,426 individuals for all our quantitative analyses.

Heritability and Genetic Correlations.

We used a classical approach for the estimation of total narrow-sense heritability and genetic correlations (16). For phenotype x—and given a familial relationship R (e.g., parent–offspring, full siblings, etc.)—if


is the correlation coefficient between two paired variables (x1, x2) holding the phenotypic observations for pairs of related individuals, the corresponding heritability is:


Similarly, the genetic correlation between phenotypes x and y, for a given familial relationship R, is:


Because disease phenotypes are binary—that is, case control—we applied the latent correlation coefficient (also known as tetrachoric correlation coefficient), which measures agreement between two raters. In its simplest form, latent trait modeling assumes that the observed binary variables result from the discretization (at a given threshold) of unobserved (latent) variables that are normally distributed. The correspondence to the liability threshold model (32, 33) is apparent. In the case-control context, raters are vectors of binary phenotypes corresponding either to within- [(x1, x2) and (y1, y2)] or between-phenotype [(x1, y2) and (y1, x2)] paired data. We note that one rater corresponds to the genealogically older member of a familial relationship (e.g., father), whereas the other rater corresponds to the genealogically younger one (e.g., daughter). In the case of “genealogically contemporary” relationships such as siblings or cousins, relatives in the raters are sorted by age.

For the estimation of latent correlation coefficients, we used a standard maximum likelihood procedure from the polycor package in R.

In the case of heritability, valid estimates were those 1) with a positive value and 2) significantly different from zero. Moreover, when heritability estimates from multiple familial relationships were available, we combined them by computing their weighted average and weighted SE.

We computed average heritability values (h2) and SE (s) weighted by sampling variance:

h2=i=1nhi2si2i=1n1si2, s=1i=1n1si2.

We also used weighted least squares to estimate the slope β (corresponding to h2) of the model


where R is a vector of correlation coefficients, Φ is a vector of kinship coefficients, μ is the intercept vector, and ε is the error vector with σ2(ε) = W−1. W is a diagonal matrix of weights used in the regression.

We carried out the analysis for all available pairs with no regard to sex. For estimates from horizontal familial relationships—that is, siblings and cousins—both individuals had to be from the same generation. For estimates from the rest of the relationships, only relatives from previous generations were considered. We did not consider heritability estimates when the correlation coefficient was negative or when the CIs fell outside [0, 1].

In the case of genetic correlations, valid estimates were those whose 95% CIs were contained within [−1, 1]. When genetic correlation estimates from multiple familial relationships were available, we combined them by computing their weighted average and weighted SE as above.

We note that estimates of heritability and genetic correlations depend on the definitions of the traits under study and that heritability of broadly defined traits will also reflect genetic correlations between the narrowly defined traits included in each broad trait category.

Leave a Reply

Your email address will not be published.