+1443 776-2705 panelessays@gmail.com
  


Presentation rubric

Grader Name__________________ Presentation week & group _______________

Cut —————————————————————————————————————

Instructions / Guidelines

You will choose a group of 3-4 peers with whom you will study one of the seminal topics we cover in this course. Your job, as a group, is to research the literature and distill this material into a 20 minute PowerPoint presentation that all group members will present equally to the class during the discussion period devoted to that topic. Your grade for this assignment will be an average of the instructor’s grade and peer assessment from the rest of the class based on the rubric below. Use this rubric to guide your research. For example, the last line states “Competing explanations or theories considered and dealt with” – this means your research should include alternative theories and support for or against them. Your last slide should include the reference you used, make sure there are at least 3 good references on the topic (but the more the better). Your instructor will provide an example presentation on neutral theory that you will get to grade to get the ball rolling.

Rubric Poor… Excellent

Presentation Skills

1

2

3

4

5

Main ideas presented in logical and clear manner?

Presentation filled the time allotted?

Slides helpful to audience?

Did the presentation maintain interest?

Was there clear take home messages?

Were the presenters responsive to audience questions?

Knowledge base

1

2

3

4

5

Appropriate background given?

Selected material appropriate to the topic?

Enough essential information given to allow effective evaluation?

Was irrelevant information (“busy work”) excluded?

Did the presenters understand the material?

Critical thinking

1

2

3

4

5

Main issues clearly identified?

Both theoretical positions and empirical evidence presented?

Strengths and weaknesses of theory / data adequately explained?

Recommendations for future work included?

Conclusions followed from material presented?

Competing explanations or theories considered and dealt with?

Overall Impression ________/85

Comments

Vol 444 | 23 November 2006 | doi:10.1038/nature05329

ARTICLES

Global variation in copy number in the
human genome
Richard Redon

1
, Shumpei Ishikawa

2,3
, Karen R. Fitch

4
, Lars Feuk

5,6
, George H. Perry

7
, T. Daniel Andrews

1
,

Heike Fiegler
1
, Michael H. Shapero

4
, Andrew R. Carson

5,6
, Wenwei Chen

4
, Eun Kyung Cho

7
, Stephanie Dallaire

7
,

Jennifer L. Freeman7 , Juan R. Gonza´lez
8
, Mo`nica Grataco`s

8
, Jing Huang

4
, Dimitrios Kalaitzopoulos

1
,

Daisuke Komura3 , Jeffrey R. MacDonald5 , Christian R. Marshall5,6 , Rui Mei4 , Lyndal Montgomery1 ,
Kunihiro Nishimura

2
, Kohji Okamura

5,6
, Fan Shen

4
, Martin J. Somerville

9
, Joelle Tchinda

7
, Armand Valsesia

1
,

Cara Woodwark1 , Fengtang Yang
1
, Junjun Zhang

5
, Tatiana Zerjal

1
, Jane Zhang

4
, Lluis Armengol

8
,

Donald F. Conrad
10
, Xavier Estivill

8,11
, Chris Tyler-Smith

1
, Nigel P. Carter

1
, Hiroyuki Aburatani

2,12
, Charles Lee

7,13
,

Keith W. Jones4 , Stephen W. Scherer
5,6

& Matthew E. Hurles
1

Copy number variation (CNV) of DNA sequences is functionally significant but has yet to be fully ascertained. We have
constructed a first-generation CNV map of the human genome through the study of 270 individuals from four populations
with ancestry in Europe, Africa or Asia (the HapMap collection). DNA from these individuals was screened for CNV using two
complementary technologies: single-nucleotide polymorphism (SNP) genotyping arrays, and clone-based comparative
genomic hybridization. A total of 1,447 copy number variable regions (CNVRs), which can encompass overlapping or
adjacent gains or losses, covering 360 megabases (12% of the genome) were identified in these populations. These CNVRs
contained hundreds of genes, disease loci, functional elements and segmental duplications. Notably, the CNVRs
encompassed more nucleotide content per genome than SNPs, underscoring the importance of CNV in genetic diversity and
evolution. The data obtained delineate linkage disequilibrium patterns for many CNVs, and reveal marked variation in copy
number among populations. We also demonstrate the utility of this resource for genetic disease studies.

Genetic variation in the human genome takes many forms, ranging at genes at which other types of mutation are strongly associated
from large, microscopically visible chromosome anomalies to single- with specific diseases: CHARGE syndrome21 and Parkinson’s and
nucleotide changes. Recently, multiple studies have discovered an Alzheimer’s disease22,23. Furthermore, CNVs can influence gene
abundance of submicroscopic copy number variation of DNA seg- expression indirectly through position effects, predispose to deleteri-
ments ranging from kilobases (kb) to megabases (Mb) in size1–8. ous genetic changes, or provide substrates for chromosomal change
Deletions, insertions, duplications and complex multi-site variants9, in evolution10,11,17,24.
collectively termed copy number variations (CNVs) or copy number In this study, we investigated genome-wide characteristics of CNV
polymorphisms (CNPs), are found in all humans10 and other mam- in four populations with different ancestry, and classified CNVs into
mals examined11. We defined a CNV as a DNA segment that is 1 kb or different types according to their complexity and whether copies
larger and present at variable copy number in comparison with a have been gained or lost (Supplementary Fig. 1). To maximize the
reference genome10. A CNV can be simple in structure, such as tan- utility of these data and the potential for integration of CNVs with
dem duplication, or may involve complex gains or losses of homo- SNPs for genetic studies, we performed experiments with the
logous sequences at multiple sites in the genome (Supplementary International HapMap DNA and cell-line collection25 derived from
Fig. 1). apparently healthy individuals. The result is the first comprehensive

An early association of CNV with a phenotype was described 70 yr map of copy number variation in the human genome, which provides
ago, with the duplication of the Bar gene in Drosophila melanogaster an important resource for studies of genome structure and human
being shown to cause the Bar eye phenotype12. CNVs influence gene disease.
expression, phenotypic variation and adaptation by disrupting genes
and altering gene dosage7,13–15, and can cause disease, as in micro- Two platforms for assessing genome-wide CNV

deletion or microduplication disorders16–18, or confer risk to complex The HapMap collection comprises four populations: 30 parent–off-
disease traits such as HIV-1 infection and glomerulonephritis19,20. spring trios of the Yoruba from Nigeria (YRI), 30 parent–offspring
CNVs often represent an appreciable minority of causative alleles trios of European descent from Utah, USA (CEU), 45 unrelated

1The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK. 2Genome Science, and 3Dependable and High Performance Computing,
Research Center for Advanced Science and Technology, University of Tokyo, 4-6-1 Komaba Meguro, Tokyo 153-8904, Japan.

4
Affymetrix, Inc., Santa Clara, California 95051, USA.

5The Centre for Applied Genomics and Program in Genetics and Genomic Biology, The Hospital for Sick Children, MaRS Centre–East Tower, 101 College Street, Room 14-701, Toronto,
Ontario M5G 1L7, Canada.

6
Department of Molecular and Medical Genetics, Faculty of Medicine, University of Toronto M5S 1A8, Canada.

7
Department of Pathology, Brigham and

Women’s Hospital, Boston, Massachusetts 02115, USA.
8
Genes and Disease Program, Center for Genomic Regulation, Charles Darwin s/n, Barcelona Biomedical Research Park,

08003 Barcelona, Catalonia, Spain.
9
Departments of Medical Genetics and Pediatrics, University of Alberta, Edmonton, Alberta T6G 2H7, Canada.

10
Department of Human Genetics,

University of Chicago, 920 East 58th Street, Chicago, Illinois 60637, USA.
11

Pompeu Fabra University, Charles Darwin s/n, and National Genotyping Centre (CeGen), Passeig Mari´tim
37-49, Barcelona Biomedical Research Park, 08003 Barcelona, Catalonia, Spain. 12Japan Science and Technology Agency, Kawaguchi, Saitama 332-0012, Japan. 13Harvard Medical
School, Boston, Massachusetts 02115, USA.

©2006 Nature Publishing Group
444

NATURE | Vol 444 | 23 November 2006 ARTICLES

Japanese from Tokyo, Japan (JPT) and 45 unrelated Han Chinese
from Beijing, China (CHB). Genomic DNA from Epstein–Barr-
virus-transformed lymphoblastoid cell-lines was used.

Two technology platforms were used to assess CNV (Fig. 1): (1)
comparative analysis of hybridization intensities on Affymetrix
GeneChip Human Mapping 500K early access arrays (500K EA), in
which 474,642 SNPs were analysed; and (2) comparative genomic
hybridization with a Whole Genome TilePath (WGTP) array that
comprises 26,574 large-insert clones representing 93.7% of the
euchromatic portion of the human genome26.

Stringent quality control criteria were set for each platform and
experiments were repeated for 82 individuals on the WGTP and 15
individuals on the 500K EA platforms. The quality of the final data
sets was assessed by the standard deviation among log2 ratios of
autosomal probes (after normalization and filtering for cell-line arte-
facts), which for the WGTP platform was 0.047 (Supplementary
Fig. 2) and for the 500K EA platform was 0.220, both of which are
improvements on published data8,27.

The different nature of the two data sets required the development
of distinct algorithms to identify CNVs. In essence, these algorithms
segment a continuous distribution of intensity ratios into discrete
regions of CNV. To train the threshold parameters, we attempted to
validate experimentally 203 CNVs that had been defined with vary-

4,5,7ing degrees of confidence in two well-characterized genomes
(NA10851 and NA15510). By performing technical replicate experi-
ments on both platforms we assessed the proportion of CNV calls
that were false positives for different algorithm parameters across a

set of experiments representing the spectrum of data quality. The
threshold parameters for both algorithms were set to achieve an
average false-positive rate per experiment beneath 5% (Methods;
see also Supplementary Methods, Supplementary Tables 1–4 and refs
26, 28).

Because all DNAs were derived from lymphoblastoid cell lines, we
differentiated somatic artefacts (such as culture-induced rearrange-
ments and aneuploidies) from germline CNVs. We karyotyped all
available 268 HapMap cell lines (Supplementary Table 5) and sought
evidence for chromosomal abnormalities in the WGTP and 500K EA
intensity data. We identified 30 cell lines with unusual chromosomal
constitutions (Supplementary Table 5 and Supplementary Fig. 3),
and removed the aberrant chromosomes from further analyses.
Chromosomes 9, 12 and X seemed to be particularly prone to tris-
omy. For a cell line with mosaic trisomy of chromosome 12, we
confirmed by array comparative genomic hybridization that this
trisomy was not apparent in blood DNA from the same individual
(Supplementary Fig. 4). Furthermore, we sought signals of somatic
deletions within the SNP genotypes of HapMap trios. A somatic
deletion in a parental genome manifests as a cluster of SNPs at which
alleles present in the offspring are not found in either parent5. We
assessed all of our preliminary CNV calls in 120 trio parents and
found that 17 (of 4,758) fell in genomic regions that harbour highly
significant clusters of HapMap Phase II SNP genotypes compatible
with a somatic deletion in a parental genome (Supplementary Table
5A, Supplementary Fig. 5 and Supplementary Note). These putative
cell-line artefacts were removed from further analyses. Extrapolating

Comparative genome hybridization

Whole Genome TilePath array

Comparative intensity analysis

Affymetrix 500K early access SNP chip

Reference

DNA

Test

DNA

Reference

DNA

Test

DNA

Test

DNA

Test

DNA

Test

DNA

Test

DNA

Genome profile

log2
(test/reference)

Chromosome profile

log2
(test/reference)

10 Mb window

log2
(test/reference)

1 21 2

Combine chips

Compare samples

Combine chips

Chromosome 8

Combine dye-swaps

Chromosome 8

1

0

–1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1819 202122 X Y 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1819202122 X

1

0

–1

1

0

–1

1

0

–1

1

0

–1

1

0

–1

50 Mb 100 Mb 150 Mb 50 Mb 100 Mb 150 Mb

2 Mb 4 Mb 6 Mb 8 Mb 10 Mb 2 Mb 4 Mb 6 Mb 8 Mb 10 Mb

NspI StyI NspI StyI

Figure 1 | Protocol outline for two CNV detection platforms. The profile shows the log2 ratio of copy number in these two genomes
experimental procedures for comparative genome hybridization on the chromosome-by-chromosome. The 500K EA data are smoothed over a five-
WGTP array and comparative intensity analysis on the 500K EA platform probe window. Below the genome profiles are expanded plots of
are shown schematically (see Supplementary Methods for details), for a chromosome 8, and a 10-Mb window containing a large duplication in
comparison of two male genomes (NA10851 and NA19007). The genome NA19007 identified on both platforms (indicated by the red bracket).

©2006 Nature Publishing Group
445

ARTICLES NATURE | Vol 444 | 23 November 2006

this analysis to the entire HapMap collection suggests that less than
0.5% of the deletions we observed were likely to have been somatic
artefacts.

The quality of resultant CNV calls was assessed in additional
26,28ways . Technical replicate experiments (triplicates for ten indivi-

duals) demonstrated that CNV calls are highly replicable (Supple-
mentary Table 6), and that noisier experiments are characterized by
higher false-negative rates, rather than higher false-positive rates
(Supplementary Fig. 2). Heritability of CNVs within trios was inves-
tigated at 67 biallelic CNVs at which CNV genotypes could be
inferred (Fig. 2; see also Supplementary Table 7). Of 12,060 biallelic
CNV genotypes, only ,0.2% exhibited mendelian discordance,
which probably reflects the genotyping error rate rather than the
rate of de novo events at these loci. Additional locus-specific experi-
mental validation was performed on subsets of CNVs (Supple-
mentary Table 4). CNVs called in only a single individual (singleton
CNVs) are more likely to be false positives compared with CNVs
identified in several individuals. We attempted to validate 50 single-
ton CNVs called on only one platform (25 from each platform) and
14 singleton CNVs called on both platforms. All 14 singleton CNVs
replicated by both platforms were verified as true positives, whereas
38 out of 50 of CNVs called by only one platform were similarly
confirmed (false-positive rate of 24%). Extrapolating these valida-
tion rates across the entire data set suggests that only 8% (24%
multiplied by the frequency of singleton CNVs called on only one
platform) of the CNV regions we identify (see below) are likely to be
false positives.

Chr8tp-17E9a | | | | | | | || | b

40

20

0

–1.0 –0.5 0 0.5
log ratios

2

Chr1tp-31C8
| | | | | || | || | |

30

20

10

0

–0.4 –0.2 0 0.2
log ratios

2

Chr5tp-22E4
20 | | | || | | | | || |

15

10

5

0

–0.5 –0.4 –0.3 –0.2 –0.1 0
log ratios

2

Chr6tp-5C12
|| | || | || | | |

20

10

0

–0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5
log ratios

2

Chr6tp-11A11
| | | | | | | || || |

F
re

q
u
e
n
c
y

F
re

q
u
e
n
c
y

F
re

q
u
e
n
c
y

F
re

q
u
e
n
c
y

F
re

q
u
e
n
c
y
NA12144 NA12145

NA10846

NA06994 NA07000

NA07029

NA18504 NA18505

NA18503

NA18501 NA18502

NA18500

20

10

0

–0.2 –0.1 0 0.1 0.2
log ratios

| |

|

2

A genome-wide map of copy number variation

The average number of CNVs detected per experiment was 70 and 24
for the WGTP and 500K EA platforms, respectively (Supplementary
Tables 8–10). Owing to the nature of the comparative analysis, each
WGTP experiment detects CNVs in both test and reference genomes,
whereas each 500K EA experiment detects CNV in a single genome.
The median size of CNVs from the two platforms was 228 kb
(WGTP) and 81 kb (500K EA), and the mean size was 341 kb and
206 kb, respectively. Consequently, the average length of the genome
shown to be copy number variable in a single experiment is 24 Mb
and 5 Mb on the WGTP and 500K EA platforms, respectively. The
larger median size of the WGTP CNVs partially reflects inevitable
overestimation of CNV boundaries on a platform comprising large-
insert clones, as CNV encompassing only a fraction of a clone can be
detected, but will be reported as if the whole clone was involved.

By merging overlapping CNVs identified in each individual, we
delineated a minimal set of discrete copy number variable regions
(CNVRs) among the 270 samples (Fig. 3; see also Supplementary
Table 11). We identified 913 CNVRs on the WGTP platform and
980 CNVRs on the 500K EA platform and mapped their genomic
distribution (Fig. 4). Approximately half of these CNVRs were called
in more than one individual and 43% of all CNVs identified on one
platform were replicated on the other. Combining the data resulted
in a total of 1,447 discrete CNVRs, covering 12% (,360 Mb) of the
human genome. Using locus-specific quantitative assays on a subset
of regions we validated 173 (12%) of these CNVRs (Supplementary
Tables 4 and 12). A minority (30%) of these 1,447 CNVRs overlapped

Figure 2 | Heritability of five CNVs in four
HapMap trios. a, The distribution of WGTP log2
ratios at five CNVs with genotype information.
Each histogram of log2 ratios in 270 HapMap
individuals exhibits three clusters, each
corresponding to a genotype of a biallelic CNV,
with the two alleles depicted by broken and
complete bars, representing lower and higher
copy number alleles, respectively. Red lines above
each histogram denote log2 ratios in the 12
individuals represented in b. b, Mendelian
inheritance of five CNVs in four parent–offspring
trios. The individual CNVs were genotyped from
WGTP clones: green, Chr8tp-17E9; yellow,
Chr1tp-31C8; blue, Chr5tp-22E4; red, Chr6tp-
5C12; black, Chr6tp-11A11.

©2006 Nature Publishing Group
446

NATURE | Vol 444 | 23 November 2006 ARTICLES

Both overlaps <threshold One overlap >threshold One overlap >threshold Both overlaps >threshold

Individual A
Thresholds:

Individual B WGTP: 40% of length
Individual C 500K EA: 30% of SNPs
Individual D
Individual E

CNV regions (CNVR)

CNVs
both overlaps >threshold

CNV ends
enriched for breakpoints

Figure 3 | Defining CNVRs, CNVs and CNV ends. Overlapping CNVs called
in five individuals are shown schematically for four loci (in blue); dashed
lines indicate overlap. Copy number variable regions (CNVRs) represent the
union of overlapping CNVs (in green). Independent juxtaposed CNVs (in
black) are identified by requiring that only individual-specific CNVs that
overlap by more than a threshold proportion be merged. Intervals

those identified in previous studies1–3,5–8,29. Combining different
classes of experimental replication revealed that 957 (66%) of the
1,447 CNVRs detected here have been replicated on both WGTP
and 500K EA platforms, or with a locus-specific assay, or in another
individual, or in a previous study (Supplementary Table 12). Whole-
genome views of CNV show that although common, large-scale CNV
is distributed in a heterogeneous manner throughout the genome
(Supplementary Fig. 6), no large stretches of the genome are exempt
from CNV (Fig. 4), and the proportion of any given chromosome
susceptible to CNV varies from 6% to 19% (Supplementary Fig. 7).

Gaps within the reference human genome assembly have an extre-
mely high likelihood of being associated with CNVs; out of the 345
gaps in the build 35 assembly, 48% (164 out of 345) are flanked or
overlapped by CNVRs. This finding highlights the complexity in
generating a reference sequence in regions of structural dynamism

1 2

encompassing CNV breakpoints (in red) are defined using platform-
dependent criteria (Supplementary Methods), and contain a significant
paucity of recombination hotspots76,77 (Supplementary Table 13), which
results from the enrichment of segmental duplications within which fewer
inferred recombination hotspots reside.

and emphasizes the need for ongoing characterization of these geno-
mic regions.

Comparing the CNVRs identified on the two platforms reveals
that the WGTP and 500K EA platforms largely complement one
another. The 500K EA platform is better at detecting smaller CNVs
(Supplementary Fig. 8), whereas the WGTP platform has more power
to detect CNVs in duplicated genomic regions (Supplementary Table
13) where 500K EA coverage is poorer30.

Some CNVRs encompass two or more independent juxtaposed
CNVs. For example, a small deletion found in one individual over-
lapping a much larger duplication in another individual was merged
into a single CNVR, despite these representing distinct events. To
delineate independent CNVs (CNV events) we applied more strin-
gent merging criteria to separate juxtaposed CNVs (Fig. 3), and
identified 1,116 and 1,203 CNVs on the WGTP and 500K EA

3 4 5 6
7

8 9 10 11 12

13 14 15 16
17 18 19 20

21 22

X

Y

CNVR lengthCall frequencyCNVR not associated with

segmental duplications
1 <10 kb

10 100 kb
CNVR associated with

100 1 Mb
segmental duplications

Figure 4 | Genomic distribution of CNVRs. The chromosomal locations of among 270 HapMap samples). When both platforms identify a CNVR, the
1,447 CNVRs are indicated by lines to either side of ideograms. Green lines maximum call frequency of the two is shown. For clarity, the dynamic range
denote CNVRs associated with segmental duplications; blue lines denote of length and frequency are log transformed (see scale bars). All data can be
CNVRs not associated with segmental duplications. The length of right- viewed at the Database of Genomic Variants (http://projects.tcag.ca/
hand side lines represents the size of each CNVR. The length of left-hand side variation/).
lines indicates the frequency that a CNVR is detected (minor call frequency

©2006 Nature Publishing Group
447

associated) associated)

ARTICLES NATURE | Vol 444 | 23 November 2006

platforms, respectively (Fig. 5; see also Supplementary Table 11). We
classified these CNVs into five types: (1) deletions; (2) duplications;
(3) deletions and duplications at the same locus; (4) multi-allelic loci;
and (5) complex loci whose precise nature was difficult to discern.
Owing to the inherently relative nature of these comparative data, it
was impossible to determine unambiguously the ancestral state for
most CNVs, and hence whether they are deletions or duplications.
Here we adopted the convention of assuming that the minor allele is
the derived allele31, thus deletions have a minor allele of lower copy
number and duplications have a minor allele of higher copy number.
Approximately equal numbers of deletions and duplications were
identified on the WGTP platform, whereas deletions outnumbered
duplications by approximately 2:1 on the 500K EA platform. In addi-
tion, 33 homozygous deletions (relative to the reference sequence)
identified on the 500K EA platform were experimentally validated
with locus-specific assays (Supplementary Table 14). Most (27 out of
33) of these have not been observed in a previous genome-wide
survey of deletions7.

To investigate mechanisms of CNV formation, we studied the
sequence context of sites of CNV. Non-allelic homologous recom-
bination can generate rearrangements as a result of recombination
between highly similar duplicated sequences32,33. Segmental duplica-

WGTP 500K EA

(% SegDup (% SegDup

Deletion
20

15

10 445 676

5 (23.6) (14.9)

0

–0.4 –0.3 –0.2 –0.1 0 0.1

Duplication
20

15

10 423 406

5 (41.4) (37.2)

0
0 0.1 0.2 0.3 0.4 0.5

Deletion & duplication

F
re

q
u
e
n
c
y 20

15
98 6510

5 (81.6) (66.2)
0

–0.6 –0.4 –0.2 0 0.2

Multi-allelic
20

15

10 19 12

5 (94.7) (91.7)
0

–0.5 0 0.5 1.0

12

8

131 44
4

(70.2) (79.5)
0

–0.4 –0.2 0 0.2 0.4 0.6

Complex

log
2
ratios 1,116 1,203

Figure 5 | Classes of CNVs. CNVs identified from WGTP and 500K EA
platforms can be classified from the population distribution of log2 ratios
(exemplified with WGTP data) into five different types (see text). Biallelic
CNVs (deletions and duplications) can be genotyped if the clusters
representing different genotypes are sufficiently distinct. The numbers of
each class of CNV identified on WGTP and 500K EA platforms are given,
along with the proportion of those CNVs that overlap segmental
duplications. The overall proportion of CNVRs overlapping segmental
duplications was 20% and 34% on the 500K EA and WGTP platforms,
respectively.

tions are defined as sequences in the reference genome assembly
sharing .90% sequence similarity over .1 kb with another genomic
location34,35. We found that 24% of the 1,447 CNVRs were associated
with segmental duplications, a significant enrichment (P , 0.05).
This association results from two factors: (1) rearrangements gener-
ated by non-allelic homologous recombination; and (2) not all anno-
tated segmental duplications are fixed in humans, but are, in fact,
CNVs. This latter point highlights the essentially arbitrary nature of
defining segmental duplications on the basis of a single genome
sequence (albeit derived from several individuals).

The likelihood of a CNV being associated with segmental duplica-
tions depended on its length and its classification: multi-allelic
CNVs, complex CNVs and loci at which both deletions and duplica-
tions occurred were markedly enriched for segmental duplications
(Fig. 5; see also Supplementary Fig. 9). This is not surprising given
the role that non-allelic homologous recombination has been shown
to have in generating complex structural variation36, arrays of tan-
dem duplications that vary in size37, and reciprocal deletions and
duplications38.

The likelihood of a segmental duplication being associated with a
CNV was greater for intrachromosomal duplications than for inter-
chromosomal duplications, and was highly correlated with increas-
ing sequence similarity to its duplicated copy (Supplementary Fig.
10). Non-allelic homologous recombination is known to operate
mainly on intrachromosomal segmental duplications and to require
97–100% sequence similarity between duplicated copies33,39.

This role for non-allelic homologous recombination in generating
CNVs in duplicated regions of the genome is supported by the
enrichment of segmental duplications within intervals that probably
contain the breakpoints of the CNV (Fig. 3). We identified 88 CNVs
from the 500K EA platform and 53 CNVs from the WGTP platform
that contain a pair of segmental duplications, one at either end. These
pairs of segmental duplications were biased towards high (.97%)
sequence similarity, and were more frequently associated with the
longest CNVs (Supplementary Fig. 11). In addition to segmental
duplications, there are other types of sequence homologies that can
promote non-allelic homologous recombination, for example, dis-
persed repetitive elements, such as Alu elements40. We performed an
exhaustive search for sequence homology of all kinds41 and identified
121 CNVs from the 500K EA platform