Statistical analysis tail, Genetic support level, Genetic prevalence workflow, and PRS
Statistical analysis (table tail)
| Statistical analysis | Rare variant의 경우는, 변이 자체의 빈도가 매우 드물기 때문에 통계적으로 변이의 영향을 비교하기 위해서는 매우 큰 수의 표본이 필요한 한계가 있다. 이러한 한계를 극복하기 위해서, 같은 유전자 내에 존재하는 희귀 변이를 유전자 또는 특정 구역의 단위로 집합 시켜서 분석하는 방법이 제안되었다. 이것이 Gene-level aggregation test 입니다. 즉, GWAS에서는 개별 SNP의 효과를 변이 단위로 분석이 진행되었다면 (Single variant association test) variant의 경우는 여러 개의 variant를 Gene 단위로 그룹화하여, Gene의 효과를 비교 분석 (Multiple variant association test)하는 분석을 진행하게 됩니다. 중요한 점은 '어떠한 기준으로 변이를 유전자 단위로 그룹화 할 것인가?' 하는 문제가 발생하게 되는데, 일반적으로 변이 빈도의 threshold 설정 (Minor allele frequency, 1%), 변이의 Functional classification 등을 사용자가 임의로 설정하여 분석을 진행하게 됩니다. 더불어, 각 희귀 변이의 효과들이 동일하지 않기 때문에 변이의 효과를 보정해주는 방법으로 Rare variant allele frequency, In-silico prediction score 등을 이용하게 됩니다. 아래는 이러한 희귀 변이 분석 방법의 장, 단점, 그리고 분석 software를 정리한 표입니다. 긴단히 정리하고 마치도록 하겠습니다. Burden test의 경우는 변이들의 효과 방향이 일정하고 Sequence Kernal Association Test (SKAT)에 대해서, SKAT의 경우는 각 변이들의 효과 방향이 제각각이거나, 원인 변이의 비율이 적은 경우에 더 강력한 통계 검정 방법이고, SKAT-O의 경우 ... 으로 최석화하는 optimized rho value를 계산하여, 양쪽의 상난점을 모두 이성함. 심성 방법입니다. SKAT-O는 이 둘의 결과를 통계식으로 최적화 … |
| What next? | 1. Replication: 더 많은 환자군에서 확인 (replication cohort) > → 2. Validation 동물 & 세포 실험에서 생물학적 입증 |
| Limitation | (still) 다중분석의 오류 2차원적인 접근법은 개인별로 흔하게 존재하는 구조 변이에 대해서는 고려할 수가 없고, 실제로 존재하는 3차원적인 공간적 효과 등을 모두 무시 4차원적인 유전자 발현의 시간적 효과 등을 무시 Cross-sectional study)이므로 여기서 얻은 Odd ratio)를 바로 질병 위험요인으로 적용하는데 무리 |
| output | To calculate which variants tend to be found more frequently in groups of people with a given disease. It serves as the best prediction for the trait that can be made when taking into account variation in multiple genetic variants.[3][4][5][6] But, polygenic scores . |
Genetic support Level
- No perfect scoring system - meant to be guidelines, but rely on analyst judgment as there may be nuances and exceptions
Genetic support level
| Mendelian genetics support | Rare variant support | GWAS support | gene-level support |
|---|---|---|---|
| Mutations in gene are linked to disorder with relevant clinical feature underlined biology | Genebased te: P<bonf | (criteria 1 to 6) | |
| 1 | 1 | 1 | very strong |
| 1 | 1 | 0 | very strong |
| 1 | 0 | 1 | very strong |
| 1 | 0 | 0 | very strong/strong |
| 0 | 1 | 1 | very strong |
| 0 | 1 | 0 | very strong / strong |
| 0 | 0 | 1 | (refer to GWAS support criteria) |
| 0 | 0 | 0 | No support |
→ ‘Genetic support level’ dropdown selection:
- very strong
- strong ⌐ Could be merged
- medium
- low
- some / no deep dive (if gene was identified in GWAS analysis but full evaluation of the gene has NOT been conducted)
- none (if gene was identified in OMICs analysis and evaluation of genetic evidence has been conducted)
- UNK (if gene was identified in OMICs analysis and NO evaluation of genetic evidence has been conducted)
GWAS criteria
| GWAS criteria | level of support |
|---|---|
| Pre-requisit: Robust association with phenotype(s) relevant to indication | |
| 1. fine-mapped variant with support from functional study (in vitro assay or in vivo model) | very strong |
| 2. fine-mapped variant = protein coding | strong |
| 3. fine-mapped variant = non-coding with potential functional link to gene via PHICe QTL | medium |
| 4. GWAS signal colocalizes with eQTL or pQTL in relevant tissue/cell type | medium |
| 3.+4. | strong |
| 5. GWAS signal colocalizes with eQTL or pQTL Tissue relevance is unclear | low |
| 6. Closest gene to GWAS signal | low |
| 7. No GWAS support | None |
¹ Consider evidence with respect to other genes at locus
Add Separate column to indicate OMICs support?
Discussions withCompBio
Genetic prevalence
Workflow to predict genetic prevalence
1. Identify pathogenic mutations
- PRKN short variants from GnomAD v3.1 database* (76,156 genomes, hg38)
- based on in silico prediction (VEP LoF, SpliceAI, CADD, Helix) — ③
- reported in databases (MDSgene, ClinVar, OMIM, UniProt) — ①
- PRKN structural variants from GnomAD v2 database* (10,847 genomes, lifted over to hg38) — ②
2. Infer heterozygous carrier frequency (probability of carrying a pathogenic allele)
For each variant v, calculate
Allele freq. based on het. Only (AFv) = (Num of alt alleles − (2·Num of hom alt genotypes)) / Total num of alleles with
AC = Num of alt alleles − (2·Num of hom alt genotypes),nhomalt = Num of hom alt genotypes,Total num of alleles = 2·num of individuals = AN
Carrier freq. (CFv) = (AFv·Total num of alleles) / num of individuals = (AFv·(2·num of individuals)) / num of individuals = 2·AFv
3. Predict the genetic prevalence (how common the disease is in the general population — homozygous and compound heterozygous)
For each pair of variants i and j, multiply their CFs: CFi ·CFj
| 6:100 A>T | 6:201 C>T | 6:305 G>T | Exon 3 del | |
|---|---|---|---|---|
| 6:100 A>T | (homo) | (het) | (het) | (het) |
| 6:201 C>T | (homo) | (het) | (het) | |
| 6:305 G>T | (homo) | (het) | ||
| Exon 3 del | (homo) |
Diagonal (blue): homozygous Other (yellow): compound het
Add all the products of CFs and divide it by 4 (for illustration):
Predicted genetic prevalence = ( Σ CFi·CFj ) / 4 = Probability of having an offspring with biallelic pathogenic mutations
Pedigree note: CFi = probability of carrying a pathogenic allele i. Probability of passing allele i to the offspring = ½·CFi. Thus, probability of the offspring carrying alleles i and j = (½·CFi)·(½·CFj) = (1/4)·CFi·CFj.
Emily Wong and Dorothée Diogo
*Additional sources under consideration for future application
polygenic risk score
GRS (=polygenic score, also called a polygenic risk score, genetic risk score, or genome-wide score)
| goal | 다양한 SNP들의 조합을 통해서 그 효과와 영향을 예측 |
| database 통계적 분석방법 |
- 기본적으로 GWAS 연관성 분석을 통해서 산출되는 effect size, β 값을 이용하게 됩니다. Constructed from the "weights" derived from a genome-wide association study (GWAS). - 영향을 미치는 다양한 SNP들의 effect size들의 조합을 선형 회귀 방법을 통해서 합치고, LD block에 대한 영향을 보정해줄 예측 모델을 생성하게 되는 것이지요. - 최근에는 다양한 통계적 방법을 통한 보정과 접근과 동시에 비선형적인 효과를 활용할 수 있는 머신 러닝 접근법도 다양하게 있습니다. |
| Example | |
| Results Method What next? | 결국 OR 을 보여주네. |
| limitation | only show correlations, not causations do not provide a baseline or timeframe for the progression of a disease |
| how to overcome limitation | |
| output |
Steps of PRS STUDY
- BASE DATA ↔ Independent Samples ↔ TARGET DATA
- BASE DATA:
- Summary statistics
- Betas/ORs weights in PRS calculation
- TARGET DATA:
- Individual-level genotype and phenotype data
- Often small sample size
- BASE DATA:
- Quality Control
- Both data sets QC’ed as standard in GWAS
- Some QC requires special care in PRS eg. sample overlap, relatedness, population structure (see Section 2)
- Retain set of SNPs that overlap between base and target data
- LD adjustment (eg. clumping) — heatmap thumbnail with starred entries
- Beta Shrinkage (eg. lasso/ridge) — bell-curve thumbnail
- P-value thresholding (PRS at multiple P) — multi-bar thumbnail
- Generate PRS + Perform Association Testing
- Out-of-sample PRS testing
- K-fold cross-validation
- Test in data separate from base/target
Results of PRS
Uncertain Spans
| location | transcription | uncertainty |
|---|---|---|
| Statistical analysis text | trailing tail of SKAT-O 는 이 둘의 결과를 통계식으로 최적화 … | the long Korean prose wraps across multiple crop columns and the trailing tokens at the right edge of the body cell are partly cut; reconstructed where unambiguous and elided with … where the trailing fragment is illegible. |
| GWAS criteria row 3 | non-coding with potential functional link to gene via PHICe QTL | reads as PHICe QTL (likely PHIC eQTL / pHi-C eQTL); transcribed verbatim because the OCR of the small abbreviation is uncertain. |
| GWAS criteria rows 4, 5 | with eQTL or pQTL (vs OCR wQTLor pQTL) | reads as eQTL or pQTL based on context (eQTL is the section heading in the document); preserved as eQTL or pQTL. |
| Compound heterozygous matrix | 4×4 yellow/blue grid for variants 6:100 A>T / 6:201 C>T / 6:305 G>T / Exon 3 del | the source figure shows a yellow/blue color matrix; the diagonal (blue) marks homozygous pairs and the off-diagonal (yellow) marks compound-het pairs; reconstructed as a labeled HTML-style table where the cell text indicates (homo) or (het). |