Project Documentation: Predictive Tool for Genetic Mutation Effects

1. Project Objective

The primary goal of this project is to develop the best computational tool (or approach) for predicting the functional effects of genetic mutations for a specific, assigned genetic disorder. This project will challenge you to integrate knowledge from genetics, bioinformatics, and data science to address a real-world problem in personalized medicine. You will conduct a deep dive into the molecular basis of a disease, curate relevant data, and apply computational methods to build and evaluate a predictive model.

2. Project Description

Each student will be assigned a monogenic disorder and its associated gene(s). Your task is to design a workflow or tool that predicts the pathogenicity of novel variants in that gene. This involves:

  1. Literature Review: A comprehensive review of the assigned genetic disorder, including its clinical features, inheritance patterns, and the function of the associated gene(s). You must also research existing computational methods used to predict mutation effects.
  2. Data Curation: Gathering and cleaning variant data from established genetic databases. You will need to collect a set of known pathogenic (disease-causing) and benign (harmless) mutations to use for training and testing your tool.
  3. Feature Engineering: Identifying and calculating relevant predictive features for the mutations. These can include sequence conservation scores, biochemical properties of amino acid changes, structural information, and other annotations.
  4. Tool Development/Selection: Implement a predictive model from scratch. This could be a rule-based system, a statistical model, or a machine learning classifier (e.g., Logistic Regression, Random Forest, Support Vector Machine). You can benefit from several existing prediction tools (e.g., SIFT, PolyPhen-2, CADD, REVEL). Your task is to systematically evaluate their performance on your disease-specific dataset and determine which tool, or combination of tools, provides the most accurate predictions for your specific gene.
  5. Evaluation: Rigorously assessing the performance of your developed or selected tool(s) using appropriate metrics such as accuracy, sensitivity, specificity, and the Matthew’s Correlation Coefficient (MCC).

3. Key Steps & Suggested Timeline 🗓️

  • Milestone 1: Disease Selection & Project Proposal (Due Week 5)
    • Receive your assigned genetic disorder.
    • Conduct preliminary research on the disease and associated gene(s).
    • Submit a one-page Project Proposal outlining your initial understanding of the problem, the gene(s) involved, potential data sources, and your chosen approach (Development vs. Comparison).
  • Milestone 2: Deep Dive & Data Collection (Due Week 7)
    • Perform an in-depth literature review.
    • Identify and download variant data from databases like ClinVar, HGMD, and gnomAD.
    • Clean and label your data, creating a high-quality dataset of pathogenic and benign variants.
  • Milestone 3: Feature Engineering & Implementation
    • Calculate predictive features for each variant in your dataset.
    • Begin coding your predictive model.
    • Run your variant list through several existing web servers or local tools and gather their predictions.
  • Milestone 4: Evaluation & Analysis
    • Implement a cross-validation strategy to test your model’s performance robustly.
    • Calculate key performance metrics and create visualizations (e.g., ROC curves, confusion matrices).
    • Critically analyze your results. Why does your tool perform well or poorly? Which features are most informative? How do existing tools compare for this specific gene?
  • Milestone 5: Final Report & Presentation (Due Week 13)
    • Synthesize your work into a formal scientific report.
    • Prepare a 3-minute presentation summarizing your project.

4. Deliverables ✍️

  1. Project Proposal (Milesone 1): A 1-page document detailing your chosen disease, gene(s), and planned methodology.
  2. Final Report: A comprehensive report formatted like a scientific paper, including the following sections:
    • Abstract: A concise summary of your project.
    • Introduction: Background on the disease, gene function, and the problem of variant interpretation.
    • Methods: Detailed description of your data sources, feature engineering, model implementation, and evaluation strategy. This section must be clear enough for someone to replicate your work.
    • Results: Presentation of your findings, including performance metrics and visualizations.
    • Discussion: Interpretation of your results, limitations of your approach, and potential future work.
    • References: Properly formatted citations.
  3. Source Code/Documentation:
    • Submit a well-commented script or a link to a GitHub repository with a README.md file explaining how to run it.
  4. Final Presentation (Finals Week): A 3-minute slide presentation summarizing your project’s objective, methods, key results, and conclusions.

5. Evaluation Criteria 💯

Your project will be graded based on the following rubric:

Category Weight Description
Code organization 20% Organization in GitHub repository
Novelty in Methodology & Data Curation 30% Rigor in data collection and cleaning. Novelty of the methods.
Results & Analysis 10% Correctness of the performance evaluation. Depth of interpretation and critical analysis of the results. Quality of figures and tables.
Final Report & Presentation 40% Clarity, organization, and scientific writing of the final report. Professionalism and effectiveness of the oral presentation.

Databases:

  • ClinVar: A public archive of reports of the relationships among human variations and phenotypes.
  • OMIM (Online Mendelian Inheritance in Man): A comprehensive catalog of human genes and genetic disorders.
  • HGMD (Human Gene Mutation Database): A collection of known (published) gene lesions responsible for human inherited disease.
  • dbSNP: A general catalog of short genetic variations.
  • gnomAD (Genome Aggregation Database): A database of exome and genome sequencing data from a large number of individuals.

Common Prediction Tools (for comparison):

  • SIFT (Sorting Intolerant From Tolerant): Predicts whether an amino acid substitution affects protein function based on sequence homology.
  • PolyPhen-2 (Polymorphism Phenotyping v2): Predicts the possible impact of an amino acid substitution on the structure and function of a human protein.
  • CADD (Combined Annotation Dependent Depletion): Scores the deleteriousness of single nucleotide variants as well as insertion/deletions.
  • REVEL (Rare Exome Variant Ensemble Learner): An ensemble method for predicting the pathogenicity of missense variants.
  • VEP (Variant Effect Predictor): A toolset from Ensembl to analyze and annotate your genomic variants.

Programming:

  • Python: Recommended language. Libraries like Pandas (data manipulation), NumPy (numerical operations), and Scikit-learn (machine learning) will be invaluable.
  • R: An excellent alternative for statistical analysis and data visualization.

GitHub Repository

Accept an empty assignment here