This directory implements AWQ (Activation-aware Weight Quantization), a state-of-the-art technique for compressing neural networks while maintaining performance. The implementation demonstrates how to reduce model size significantly with minimal accuracy degradation.
Model quantization reduces the precision of neural network weights and activations, enabling deployment on resource-constrained devices while maintaining competitive performance. AWQ specifically focuses on preserving important weights based on activation patterns, offering superior accuracy-compression trade-offs.
Complete Weight Quantization Pipeline
Demonstrates end-to-end quantization of a neural network using AWQ methodology:
Original Full-Precision Network
A straightforward feedforward network for testing quantization:
- Architecture: 64 → 128 → 256 → 128 → 10 (configurable dimensions)
- Activation: ReLU between hidden layers
- Output: Raw logits for classification
- No Bias: Simplified design focusing on weight quantization
Post-Quantization Network Wrapper
Manages quantized model components:
- Quantized Layers: Converted low-precision weight matrices
- Scale Information: Metadata for activation scaling
- Identical Forward: Same computation graph as original network
Key Functions:
-
compute_activation_stats()(from helper.py):- Collects activation statistics during forward passes
- Measures channel-wise activation magnitudes
- Identifies important activation channels for weight preservation
-
search_module_scale()(from helper.py):- Finds optimal scaling factors for each layer
- Balances quantization error with activation importance
- Returns quantized weights and scaling metadata
-
quantize_model():- Orchestrates the complete quantization pipeline
- Applies AWQ to each linear layer sequentially
- Preserves activation functions and network structure
Unlike naive quantization approaches that treat all weights equally, AWQ:
Core Insight: Weights corresponding to large activation channels contribute more to model output and should be preserved with higher precision.
Mathematical Foundation:
For layer output Y = XW:
- Measure activation magnitudes ||X[:, i]|| for each channel i
- Scale quantization based on activation importance
- Preserve high-impact weights, aggressively quantize low-impact ones
- Calibration: Run representative data through the network
- Statistics Collection: Measure per-channel activation magnitudes
- Scale Search: Find optimal scaling factors for weight quantization
- Weight Conversion: Convert weights to low-precision representation
- Metadata Storage: Save scaling information for inference
- Synthetic Data: Generates realistic test data for validation
- Weight Comparison: Detailed analysis of quantization effects per layer
- Output Verification: Compares model outputs before/after quantization
- Size Analysis: Calculates compression ratios achieved
The implementation provides detailed metrics:
Weight Error Analysis:
- Layer-by-layer weight difference computation
- Accumulation of quantization errors across the network
- Identification of layers most affected by quantization
Model Size Reduction:
- Calculates exact memory savings from quantization
- Reports compression ratios (typically 50-75% size reduction for 8-bit)
- Accounts for different bit-widths (4-bit, 8-bit options)
- GPU Compatibility: Designed for CUDA execution
- Configurable Precision: Supports different quantization bit-widths
- Preservation of Architecture: Maintains original network structure
- Inference Efficiency: Quantized models suitable for deployment
python awq.pyExpected Output Flow:
- Original Weights: Display full-precision weight matrices
- Quantization Process: AWQ algorithm execution
- Quantized Weights: Show post-quantization weight values
- Error Analysis: Per-layer quantization error accumulation
- Output Comparison: Before/after quantization inference results
- Compression Stats: Model size reduction percentage
# Modify quantization parameters
quantization_bits = 4 # More aggressive compression
input_dim = 128 # Larger network
hidden_dims = [256, 512, 256, 10] # Different architecture- Uniform Scaling: Applies same quantization to all weights
- Limited Accuracy: Can suffer significant performance degradation
- Simple Implementation: Straightforward to apply
- Training Integration: Simulates quantization during training
- High Accuracy: Can achieve excellent results
- Computational Cost: Requires full retraining
- Selective Quantization: Preserves important weights based on activations
- Post-Training: No retraining required
- Optimal Trade-off: Better accuracy than PTQ, faster than QAT
- Memory Reduction: 2-8x smaller model sizes
- Inference Speed: Faster computation with low-precision arithmetic
- Energy Efficiency: Reduced power consumption on mobile devices
- Hardware Support: Better utilization of specialized inference chips
- Mobile Deployment: Fitting large models on smartphones/tablets
- Edge Computing: IoT devices with limited resources
- Server Optimization: Higher throughput in data centers
- Real-time Applications: Reduced latency for time-critical systems
- Activation Awareness: Leveraging activation patterns leads to better quantization decisions
- Selective Preservation: Not all weights are equally important for model performance
- Calibration Data: Quality of calibration dataset significantly impacts results
- Layer Sensitivity: Different layers show varying sensitivity to quantization
This implementation demonstrates how modern quantization techniques can dramatically reduce model size while maintaining competitive accuracy, making deployment of large neural networks feasible across diverse computing environments.