Hi,

I'm trying to determine which of my SNPs from WG-NGS are independent of LD, i.e. tagging SNPs. The downstream use for this is when testing genotypes against a trait, the number of independent tests is known, generally allowing significance thresholds to be set.

I've tried different window sizes and step sizes, and they give quite a range of results in terms of number of independent SNPs and the population structure (not surprisingly).

How does one select the correct window and step sizes?

I guess larger window sizes are prefered if one has the RAM, and that using step sizes which have more SNPs than the window size will result in ignoring much of the genome. Large window sizes with small step sizes would maybe give a more random selection of SNPs ..

At an intermediate step size of 100 SNP, the number of independent SNPs, drops from 246k to 26k between window size of 10kb and 50Kb...this is a ten-fold difference which would be important when considering independent tests.

The sample population are 220 Drosophila melanogaster, the points on the scatter-plots are coloured by order sequenced. On average, the pair-wise r2 levels-out after 250bp, and there is one SNP every 150bp. The separation of the population into four groups is supported by Fst analysis showing two distinct regions causing this.

These are the plots https://zenodo.org/record/814933/files/lhm_ibd_tests.png and code https://doi.org/10.5281/zenodo.814932

Any nice comments, especially useful ones are most welcome. Also questions welcome.

Sincerely,

Will

