🧠 Motivation
My initial attempt at facial expression recognition, completed during my university studies, involved a convolutional neural network (CNN). Despite limited resources—a modest GTX 1050 GPU and sparse datasets—I managed to achieve an accuracy between 65-75%. Although promising, it wasn’t sufficient for real-world applications, particularly for something as sensitive as mental health diagnostics.
Fast forward to today: armed with greater experience, advanced hardware (RTX 3060), and modern ML frameworks, I revisited this challenge using Vision Transformers (ViT). The goal was clear: surpass 80% accuracy and deepen my expertise in state-of-the-art deep learning techniques.
📊 Objective
Build and deploy a robust 7-class facial expression recognition system capable of accurately identifying fundamental human emotions:
Anger
Disgust
Fear
Happiness
Neutral
Sadness
Surprise
🖼️ Image Processing & Embedding Systems
To enable high-performing ViT-based emotion classification, the system depends heavily on rigorous image processing and efficient visual embedding strategies.
📦 Preprocessing Pipeline
Each image undergoes a well-designed set of operations:
- Face Region Isolation: All datasets are either pre-aligned (RAF-DB, AffectNet) or assume center-cropped faces.
📌 Transformations Applied (PyTorch)
Training augmentations were applied via torchvision.transforms
, using the following logic:
- Resize: All images resized to 224x224 (ViT input)
- Random Horizontal Flip: Helps the model generalize across left/right facial symmetry
- Random Rotation (±10°): Adds rotation invariance, useful for slightly tilted faces
- Color Jitter: Adjusts brightness, contrast, and saturation for lighting diversity
- Normalization: Converts RGB to standardized values in range
[-1, 1]
transforms.Compose([
transforms.Resize((224, 224)),
transforms.ColorJitter(...),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.5]*3, [0.5]*3)
])
✂️ CutMix Augmentation
Beyond traditional transforms, I integrated CutMix, a powerful augmentation technique that:
Cuts a patch from one image and pastes it into another
Mixes the labels proportionally
Helps the model learn spatial invariance and improves generalization under occlusion or noise
This was particularly effective for small classes like fear and disgust.
CutMix is especially useful when training ViTs, as they handle global patterns better than CNNs and benefit more from mixed-structure inputs.
🧠 Embedding System: Vision Transformer
The input image is embedded into a patch-level sequence using
patch_size=16
.Each patch is linearly projected and positional embeddings are added before being fed into the transformer encoder.
Outputs are globally pooled and passed into a linear classification head.
ViT thus converts (B, 3, 224, 224)
into a sequence (B, N_patches, D)
which is finally pooled into (B, D)
before classification.
🏗️ Dataset Engineering
Effective data handling was crucial to the project’s success. Typical facial expression recognition (FER) datasets suffer from noise, inconsistencies, and imbalance. I tackled these issues directly:
Dataset | Link | Notes |
---|---|---|
FER+ | ferplus-7cls | Reduced and standardized to seven fundamental emotions |
AffectNet | affectnet_no_contempt | Removed the less relevant ‘contempt’ class to align with the project’s emotion set |
RAF-DB | raf-db-7emotions | Addressed multiple data issues by manually augmenting neutral expressions from FER+ and standardizing labels |
These enhanced datasets collectively provided 75,398 training and 8,377 validation samples, offering sufficient variety and balance to effectively train the model.
🧪 Advanced Model Training
The backbone for this project was ViT-Tiny (patch16_224) provided by timm, offering powerful performance even under hardware constraints. Key elements of my optimized training workflow included:
Optimizer: AdamW for adaptive gradient handling.
Scheduler: Cosine annealing learning rate with warmup phases, aiding convergence and stability.
Augmentation: Implemented CutMix and horizontal flipping, significantly improving generalization.
Mixed Precision Training (AMP): Enabled faster computations and reduced memory footprint, crucial for efficient GPU utilization.
Sample Training Workflow:
python train.py
Interactive Model Deployment:
uvicorn app:app --host localhost --port 8000 --reload
Ⓜ️ Model
📊 Metrics
📈 Performance & Insights
The model achieved a remarkable 82.2% validation accuracy, with balanced precision and recall across emotions, notably excelling in recognizing happiness (93.4%) and neutrality (91.8%). This demonstrates ViT’s effectiveness in handling complex visual patterns compared to traditional CNN-based methods.
Key insights:
ViT-Tiny is excellent for constrained environments, providing near state-of-the-art accuracy.
Meticulous dataset preparation significantly enhances model performance.
Data augmentation strategies like CutMix, paired with cosine LR scheduling, lead to robust learning.
Publicly sharing datasets enhances reproducibility and fosters community engagement.
🔮 Future Directions
To further push the boundaries of this project, the following advancements are planned:
Scaling Up: Experiment with ViT-Base architecture combined with LoRA fine-tuning to capture more intricate visual patterns.
PHQ Score Integration: Incorporate a regression head to predict Patient Health Questionnaire (PHQ) scores from facial expressions, directly targeting depression detection.
Interactive Demos: Create user-friendly interfaces using Streamlit or Gradio for broader accessibility.
🚧 Challenges & Solutions
Dataset Complexity: Integrating multiple disparate datasets was non-trivial. Custom preprocessing scripts resolved inconsistencies and improved data quality.
Resource Constraints: Opting for ViT-Tiny balanced model complexity and hardware limitations (RTX 3060).
Regression Task (PHQ): Initially planned, the regression task required extensive multimodal video data. Limited access led to postponement, but groundwork has been laid for future integration.
🙌 Final Thoughts
This project symbolizes my journey from academic exploration to professional-grade ML engineering. It highlights the iterative nature of machine learning—each step refines understanding and enhances technical capability. For those transitioning into AI, embracing incremental improvement is essential.
Explore the complete project source on GitHub (face-vit-phq) and follow my continuing journey in AI on Deanhub.
📬 Connect with Me
Dean Ng Kwan Lung
Blog : Portfolio
LinkedIn : LinkedIn
GitHub : GitHub
Email : kwanlung123@gmail.com