


Project Overview
In this project, I developed machine learning models to predict breast cancer stages from a dataset of 4,024 records, each with 16 clinical and demographic attributes. The project’s main goal is to enable timely identification and management of breast cancer stages, ultimately improving patient care and outcomes.
Technical Approach & Skills
To ensure robust and reliable predictions, I implemented the following steps:
- Data Collection & Wrangling: Utilized Python (Pandas, NumPy) to merge, clean, and preprocess data from the SEER dataset.
- Feature Engineering: Created derived features and ensured balanced classes through stratified sampling, improving model performance.
- Model Development: Trained and tuned Random Forest and Support Vector Machine (SVM) models using scikit-learn.
- Model Evaluation: Measured accuracy, precision, recall, and F1-scores to validate performance across multiple stages.
- Visualization: Employed Matplotlib/Seaborn to generate insightful charts for data exploration and final reporting.
- Version Control & Collaboration: Used Git and GitHub to manage and share code, ensuring reproducibility.
Key Skills Demonstrated: Python, scikit-learn, Pandas, NumPy, Data Cleaning, Exploratory Data Analysis (EDA), Random Forest, SVM, Model Evaluation, Visualization, Version Control.
Key Results
- Random Forest (100% Accuracy): Provided a perfectly accurate classification on the test set, indicating strong potential for high-stakes clinical environments.
- SVM (99.75% Accuracy): Highly effective at capturing complex patterns within the data.
- Precision & Recall: Both models demonstrated robust metrics, minimizing both false positives and false negatives.
Impact & Future Direction
By accurately predicting cancer stages, these models could support oncologists in making data-driven decisions, enabling more personalized treatment plans. Future iterations may include larger, more diverse datasets and advanced deep learning techniques for even higher accuracy and scalability.
Explore the Project
For the complete code and technical details, visit my GitHub repository.
You can also read the full project report for further insights.