arrow_back Back to Projects
Python Flask Machine Learning Cybersecurity scikit-learn

Neural-Based Static Malware Analysis Engine

End-to-end machine learning solution analyzing PE files to classify malware, utilizing a Random Forest model wrapped in a Flask API.

Neural-Based Static Malware Analysis Engine screenshot

Neural-Based Static Malware Analysis Engine

Overview

A complete end-to-end machine learning solution designed to statically analyze Windows Portable Executable (PE) files and classify them as goodware or malware. The project bridges the gap between complex data science and user-friendly web architecture, offering an interactive custom UI that features both individual file analysis and bulk processing capabilities.

Key Features

  • Robust Machine Learning Pipeline:
    • Trained on the Brazilian Malware Dataset (50,731 samples).
    • Evaluated 7 distinct models (including XGBoost, LightGBM, CatBoost, PyTorch MLP, and Random Forest) using 10-fold stratified cross-validation.
    • Selected Random Forest for production due to superior, consistent performance.
    • Performance Benchmarks: 99.82% AUC and 98.76% Accuracy on a completely unseen hold-out test set.
  • Interactive Web Platform:
    • Developed a Flask application enabling users to input PE numeric features manually or upload batch CSV files for processing.
    • Real-time generation of evaluation metrics including Dynamic Confusion Matrices, overall Accuracy, and AUC when batch labels are provided.
  • Premium User Interface:
    • Engineered a cohesive, immersive “Cyber-Intelligence” theme entirely from scratch.
    • Features dynamic CSS grid backgrounds, staggered sequence load animations, and tailored typography (Share Tech Mono and Rajdhani) avoiding generic template aesthetics.
  • Production Ready:
    • Implemented unit testing (pytest), modular configurations, and integrated a CI/CD pipeline using GitHub Actions to govern deployment.

Technical Stack

  • Machine Learning: scikit-learn, PyTorch, XGBoost, Pandas, NumPy
  • Backend Infrastructure: Python, Flask, Gunicorn
  • Frontend Design: HTML5, Custom CSS3 Variables, CSS Keyframe Animations
  • DevOps & Tooling: pytest, GitHub Actions (CI/CD)

Challenges & Solutions

  • Challenge: Avoiding data leakage during cross-validation, especially concerning feature scaling.
  • Solution: Implemented strict scikit-learn Pipeline objects ensuring that StandardScaler transformations were only fit on the distinct training folds before transforming the validation folds.
  • Challenge: Designing an interface that matched the serious nature of a cybersecurity tool without looking like a generic Bootstrap framework.
  • Solution: Designed a custom aesthetic system utilizing 3D CSS perspective transforms for background grids, scanlines overlays for depth, and targeted neon color variables to highlight threats vs. safe files interactively.