The raw reads for 1001 Genomes Project whole-genome shotgun sequence data were downloaded from the NCBI SRA repository (PRJNA273563), https://www.ncbi.nlm.nih.gov/bioproject/PRJNA273563. Reads were mapped to the TAIR10 reference genome assembly using BWA-MEM v.07.15 with the default parameters.Picard Tools v. 2.7.1 was used for data sorting. The duplicates were marked with SAMTools V.1.3.1. Alignments of 1,064 samples were used as input for Genome STRiP (v. 2.00.1774). The software required the precomputing of “reference metadata” based on the A. thaliana TAIR10 genome sequence, as described in the software documentation (http://software.broadinstitute.org/software/genomestrip/node_ReferenceMetadata.html). All required information was generated according to this documentation except for the lcmask.fasta file (low-complexity mask), where the regions marked as “Low complexity”, “Satellites” and “Simple repeat” were obtained from RepeatMasker results. Additionally, the TAIR10 reference sequence contained ambiguous nucleotides, which were not permitted by the CNVDiscovery Pipeline. Therefore, the positions with nucleotides other than A, C, G, T or N were changed to N and masked in the genome alignability mask (svmask file) by our own scripts. The genome STRiP SVGenotyper module was used for genotyping genes (Araport 11 annotation, for each gene the most outer left and right coordinate was found by comparing all models of this gene; these coordinates were used for the genotyping) in a worldwide population. Prior to genotyping, the nonunique segments in the reference genome were found by creating subsequence strings with 40-bp sliding windows and a 1-bp step and aligning them with the reference genome; the nonunique segments were masked. All variants in the input vcf files were marked with a SVTYPE tag specifying a general copy number variant ("CNV"). We ultimately obtained the genotyping data for 26,845 genes and 1060 accessions. The comparison of the unrounded copy numbers and integer copy number genotypes with the results of the MLPA assays for a subset of CNV-genes indicated that the copy number genotypes were frequently not correctly assigned by the SVGenotyper. Therefore we did not use genotype confidence filter integrated into the GenomeStrip software. Instead, a custom filter based on the unrounded copy number distribution in the Col-0 accession was used for marking outliers, defined as genes falling below the lower quartile or above the upper quartile of the copy number values distribution in this accession (IQR filter). The threshold values were calculated separately for CNV-genes, genes overlapped by low-confidence CNVs and genes not overlapped by any CNV (see the associated publication for the details about CNV identification).

More information:
Agnieszka Zmienko, Malgorzata Marszalek-Zenczak, Pawel Wojciechowski, Anna Samelak-Czajka, Magdalena Luczak, Piotr Kozlowski, Wojciech M. Karlowski, Marek Figlerowicz
AthCNV - a map of DNA copy number variations in the Arabidopsis thaliana genome