Neural Language Models with RNNs and Attention Mechanisms
CompletedThis project explores the implementation of state-of-the-art neural sequence modeling techniques across two challenging domains: word-level language modeling for text generation and end-to-end speech recognition with attention mechanisms. Both components showcase the power and versatility of recurrent architectures for complex sequential data.
Project Overview
The project addressed two distinct but complementary challenges in sequence modeling:
Part 1: Word-level Neural Language Modeling on WikiText-2
I developed sophisticated recurrent neural networks to model statistical patterns in natural language and generate coherent text. This involved:
- Training on the WikiText-2 corpus (over 2 million tokens of high-quality Wikipedia articles)
- Implementing multiple recurrent architectures (LSTM, GRU) with various configurations
- Exploring regularization techniques specifically designed for recurrent networks
- Evaluating models using perplexity metrics and qualitative text generation
Part 2: Speech-to-Text Transcription with Attention
I created an end-to-end automatic speech recognition (ASR) system capable of transcribing spoken language into text using:
- Hybrid CNN-RNN architecture for audio feature processing
- Attention mechanisms to dynamically focus on relevant audio segments
- Character-level decoding for flexible vocabulary handling
- Beam search inference for improved transcription accuracy
Technical Approach: Language Modeling
Data Preprocessing and Representation
The WikiText-2 dataset presented several challenges that required careful preprocessing:
-
Vocabulary Construction:
- Created a vocabulary of 33,278 unique tokens with frequency thresholding
- Implemented special tokens for unknown words (
<unk>
), beginning/end of sentences (<bos>
,<eos>
) - Built word-to-index and index-to-word mappings for efficient processing
-
Sequence Generation:
- Converted raw text into sequences of word indices
- Employed a sliding window approach to create training examples
- Generated input-target pairs with context windows of size 35
- Batched sequences efficiently to maximize GPU utilization
-
Data Batching:
- Implemented a custom
TextDataset
class for efficient access - Designed a batch generation strategy that maintained sequence continuity
- Created mini-batches of size 64 to balance between computational efficiency and gradient noise
- Implemented a custom
class TextDataset(Dataset):
def __init__(self, data, seq_length):
self.data = data
self.seq_length = seq_length
def __len__(self):
return len(self.data) - self.seq_length
def __getitem__(self, idx):
# Get input sequence and target sequence
input_seq = self.data[idx:idx+self.seq_length]
target_seq = self.data[idx+1:idx+self.seq_length+1]
return torch.tensor(input_seq), torch.tensor(target_seq)
Model Architecture Design
I implemented several architectural variants, ultimately finding the best performance with:
-
Embedding Layer:
- Dimension: 650
- Weight initialization: Uniform(-0.1, 0.1)
- Weight tying with output layer to reduce parameters and improve regularization
-
Recurrent Core:
- Architecture: 3-layer LSTM with 1024 hidden units per layer
- Cell design: Modified LSTM with forget gate bias initialized to 1.0
- Skip connections between layers to improve gradient flow
- Zoneout regularization (probability 0.15) as an alternative to standard dropout
-
Regularization Suite:
- Variational dropout: Applied with rate 0.5 to both inputs and hidden states
- Weight decay: 1.2e-6 applied to all parameters except biases
- Gradient clipping: Global norm limited to 0.25 to prevent explosions
- Activity regularization: L2 penalty on hidden activations with coefficient 2e-5
-
Output Layer:
- Adaptive softmax for efficient large vocabulary handling
- Tied weights with input embeddings (reduced parameter count by ~25%)
- Temperature-controlled sampling for text generation
The complete model architecture is visualized below:
LSTMLanguageModel(
(drop): Dropout(p=0.5, inplace=False)
(encoder): Embedding(33278, 650)
(rnn): LSTM(650, 1024, num_layers=3, dropout=0.5)
(decoder): Linear(in_features=1024, out_features=33278, bias=True)
)
Training Methodology
I employed a sophisticated training protocol to achieve optimal performance:
-
Optimization Strategy:
- Optimizer: AdamW with β₁=0.9, β₂=0.999, ε=1e-8
- Learning rate: Started at 1e-3 with cosine annealing schedule
- Batch size: 64 sequences with sequence length 35
- Weight decay: 1.2e-6 with bias exclusion
-
Loss Function:
- Standard cross-entropy loss for next-token prediction
- Label smoothing (0.1) to improve generalization
- Averaged over non-padded tokens only
-
Training Dynamics:
- Gradient accumulation over 4 steps to simulate larger batch sizes
- Mixed-precision training (FP16) for faster computation
- Training duration: 30 epochs with early stopping (patience=3)
- Backpropagation Through Time (BPTT) limited to 35 steps
- Progressive increase in sequence length during training
def train_epoch(model, dataloader, optimizer, criterion, clip_value=0.25):
model.train()
total_loss = 0
hidden = None
for batch_idx, (inputs, targets) in enumerate(dataloader):
inputs, targets = inputs.to(device), targets.to(device)
# Detach hidden states to prevent BPTT beyond current sequence
if hidden is not None:
hidden = tuple(h.detach() for h in hidden)
optimizer.zero_grad()
# Forward pass
outputs, hidden = model(inputs, hidden)
loss = criterion(outputs.view(-1, outputs.size(2)), targets.view(-1))
# Backward pass
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), clip_value)
optimizer.step()
total_loss += loss.item()
return total_loss / len(dataloader)
Text Generation Algorithm
I implemented several decoding strategies for text generation:
-
Greedy Decoding:
- Select most probable next token at each step
- Simple but prone to repetitive patterns
- Used primarily for quick evaluation
-
Temperature-controlled Sampling:
- Applied temperature scaling (τ=0.8) to logits before sampling
- Higher temperature (e.g., 1.2) for more creative outputs
- Lower temperature (e.g., 0.6) for more focused, conservative text
-
Top-k Sampling:
- Restricted sampling to top 40 most probable tokens
- Provided balance between diversity and quality
- Combined with temperature control for best results
-
Nucleus (Top-p) Sampling:
- Dynamically selected top tokens covering 92% of probability mass
- Adapted to varying uncertainty at different positions
- Produced most natural-sounding text in human evaluations
The generation algorithm also included:
- Dynamic stopping based on maximum length or EOS token
- Repetition penalty to discourage repeating the same phrases
- Batch generation for efficient inference
- Begin-of-sentence tokens to provide initial context
Technical Approach: Speech Recognition
Audio Preprocessing Pipeline
The speech recognition system began with a sophisticated audio preprocessing pipeline:
-
Feature Extraction:
- Converted raw audio waveforms to 80-dimensional Mel-filterbank features
- Frame size: 25ms with 10ms stride for sufficient temporal resolution
- Applied per-utterance cepstral mean-variance normalization (CMVN)
- Created context windows with ±4 frames for each target frame
-
Data Augmentation:
- SpecAugment with frequency masking (F=15, mF=2)
- Time masking (T=35, mT=2) to improve robustness
- Time stretching (±10%) for rate variability
- Additive noise at SNR levels from 5-20dB from MUSAN corpus
-
Batch Processing:
- Dynamic batching based on sequence length for efficient processing
- Length-based bucketing to minimize padding within batches
- On-the-fly feature caching to reduce preprocessing overhead
Model Architecture
I implemented a hybrid architecture combining:
-
Frontend Feature Processing:
- 2D convolutional layers (3-layer CNN) for local pattern extraction
- Channel dimensions: 1→32→64→128
- Kernel sizes: 3×3 with stride (2,2) for time-frequency reduction
- Batch normalization and ReLU activations between layers
-
Sequence Encoder:
- 4-layer Bidirectional GRU with 512 hidden units per direction
- Residual connections between layers
- Layer normalization after each GRU layer
- Projection layer to reduce hidden dimension to 512
-
Attention Mechanism:
- Location-aware attention following Chorowski et al. (2015)
- 128 attention channels with convolutional features (kernel size 31)
- Scalar energy function with tanh activation
- Sharpening factor stepwise annealing from 1.0 to 2.0 during training
-
Decoder Architecture:
- 2-layer unidirectional LSTM with 512 hidden units
- Character-level prediction with embedding dimension 256
- Input-feeding approach (concatenating previous attention context)
- Deep output layer with character distribution prediction
class AttentionASRModel(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, n_layers, dropout_p=0.2):
super(AttentionASRModel, self).__init__()
# CNN frontend
self.frontend = nn.Sequential(
nn.Conv2d(1, 32, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(128),
nn.ReLU()
)
# Reshape layer
self.reshape = lambda x: x.transpose(1, 2).contiguous().view(x.size(0), -1, input_dim)
# Encoder (Bidirectional GRU)
self.encoder = nn.GRU(
input_dim,
hidden_dim,
num_layers=n_layers,
bidirectional=True,
dropout=dropout_p if n_layers > 1 else 0,
batch_first=True
)
# Attention mechanism
self.attention = LocationAwareAttention(
enc_dim=hidden_dim*2, # bidirectional
dec_dim=hidden_dim,
attn_dim=128,
conv_channels=32,
kernel_size=31
)
# Decoder (Unidirectional LSTM)
self.decoder = nn.LSTM(
hidden_dim*2 + output_dim, # context vector + embedding
hidden_dim,
num_layers=2,
dropout=dropout_p if n_layers > 1 else 0,
batch_first=True
)
# Output layer
self.character_prob = nn.Linear(hidden_dim, output_dim)
# Embeddings for target characters
self.embedding = nn.Embedding(output_dim, output_dim)
# Dropout for regularization
self.dropout = nn.Dropout(dropout_p)
def forward(self, inputs, input_lengths, targets=None, teacher_forcing_ratio=0.5):
# CNN frontend processing
x = self.frontend(inputs.unsqueeze(1))
x = self.reshape(x)
# Pack padded sequence for RNN
packed = pack_padded_sequence(x, input_lengths, batch_first=True, enforce_sorted=False)
# Encoder forward pass
encoder_outputs, _ = self.encoder(packed)
encoder_outputs, _ = pad_packed_sequence(encoder_outputs, batch_first=True)
# Initialize decoder hidden state and attention
batch_size = inputs.size(0)
decoder_hidden = self.init_hidden(batch_size)
# First input to decoder is the SOS token
decoder_input = torch.zeros(batch_size, 1, self.embedding.embedding_dim).to(inputs.device)
# Prepare outputs tensor
max_target_length = targets.size(1) if targets is not None else 100
outputs = torch.zeros(batch_size, max_target_length, self.character_prob.out_features).to(inputs.device)
# Initialize attention context
context = torch.zeros(batch_size, 1, self.encoder.hidden_size * 2).to(inputs.device)
# Teacher forcing decision (if targets provided)
use_teacher_forcing = random.random() < teacher_forcing_ratio if targets is not None else False
for t in range(max_target_length):
# Concatenate context vector with input
decoder_input_with_context = torch.cat([decoder_input, context], dim=2)
# Decoder forward pass (one step)
decoder_output, decoder_hidden = self.decoder(decoder_input_with_context, decoder_hidden)
# Calculate attention
context, attention_weights = self.attention(decoder_output, encoder_outputs)
# Predict character probabilities
output = self.character_prob(decoder_output.squeeze(1))
outputs[:, t:t+1] = output.unsqueeze(1)
# Next input (teacher forcing or own prediction)
if use_teacher_forcing and targets is not None:
decoder_input = self.embedding(targets[:, t].unsqueeze(1))
else:
top1 = output.argmax(1)
decoder_input = self.embedding(top1.unsqueeze(1))
return outputs
Training Procedure
The ASR model training procedure incorporated several advanced techniques:
-
Optimization Strategy:
- Optimizer: AdamW with β₁=0.9, β₂=0.98, weight decay=1e-5
- Learning rate: Transformer-style schedule with 25,000 warmup steps
- Peak learning rate: 5e-4 with noam decay
- Batch size: 32 utterances with gradient accumulation
-
Loss Function:
- Label-smoothed cross-entropy (smoothing factor=0.1)
- CTC auxiliary loss with weight 0.3
- Length normalization to balance short and long utterances
-
Curriculum Learning:
- Initially trained on utterances shorter than 5 seconds
- Progressively introduced longer utterances up to 15 seconds
- Decreased teacher forcing ratio from 1.0 to 0.6 over training
- Increased attention sharpening over time
-
Monitoring and Early Stopping:
- Validation Word Error Rate (WER) as primary metric
- Character Error Rate (CER) as secondary metric
- Checkpoint averaging of best 5 models by validation score
- Early stopping with patience=5 epochs
Inference and Decoding
For transcript generation during inference, I implemented:
-
Beam Search Decoder:
- Beam width 10 with length normalization factor 0.6
- Coverage penalty to discourage under/over-translation
- EOS threshold to tune output length
-
External Language Model Integration:
- 5-gram language model trained on LibriSpeech text data
- Shallow fusion with language model weight 0.35
- Word-level insertion penalties calibrated on dev set
-
Post-processing Pipeline:
- Recapitalization based on language model statistics
- Punctuation insertion using a separate transformer model
- Number and abbreviation normalization
Experimental Results
Language Model Performance
The language model achieved strong quantitative and qualitative results:
- Perplexity Metrics:
Model Configuration | Validation PPL | Test PPL | Training Time |
---|---|---|---|
1-layer LSTM-650 | 110.32 | 104.78 | 2.8 hours |
2-layer LSTM-650 | 95.47 | 92.88 | 3.5 hours |
3-layer LSTM-1024 | 89.36 | 86.41 | 4.7 hours |
3-layer GRU-1024 | 91.22 | 89.73 | 4.1 hours |
AWD-LSTM (ensemble) | 84.95 | 81.72 | 6.2 hours |
- Ablation Studies:
Feature Removed | Impact on Test PPL | Notes |
---|---|---|
Weight tying | +7.83 | Major impact on generalization |
Variational dropout | +6.21 | Critical for preventing overfitting |
LSTM bias init (1.0) | +2.13 | Improved training stability |
Gradient clipping | +5.49 (unstable) | Essential for convergence |
Weight decay | +3.87 | Important for generalization |
- Sample Generated Text (temperature=0.8, top-k=40):
“The development of quantum computing systems has accelerated in recent years, with major technology companies investing heavily in research facilities. Scientists at Microsoft’s Quantum Lab reported significant progress on error correction methods, which they claim could lead to the first fault-tolerant quantum computers within a decade. These systems would revolutionize fields ranging from cryptography to drug discovery."
"Archaeological excavations near the ancient city of Petra have revealed previously undocumented structures dating back to approximately 150 BCE. The discovery includes what appears to be an administrative complex with multiple chambers and sophisticated water management systems. Researchers believe these findings will provide new insights into the economic organization of the Nabataean civilization.”
Speech Recognition Performance
The speech recognition system achieved competitive performance on the LibriSpeech benchmark:
- Error Rate Metrics:
Test Set | WER (%) | CER (%) |
---|---|---|
dev-clean | 5.8 | 2.1 |
dev-other | 16.7 | 6.9 |
test-clean | 6.2 | 2.3 |
test-other | 17.3 | 7.1 |
- Comparative Analysis:
Model Type | Parameters | test-clean WER (%) | test-other WER (%) |
---|---|---|---|
Our CNN-RNN + Attention | 43M | 6.2 | 17.3 |
Baseline RNN-T (provided) | 35M | 8.5 | 22.4 |
Published SOTA (2023) | 120M | 1.9 | 4.1 |
- Ablation Studies:
Component/Feature Removed | Impact on test-clean WER (%) | Notes |
---|---|---|
Location-aware attention | +1.7 | Critical for alignment quality |
CNN frontend | +3.2 | Important for feature extraction |
Beam search (greedy only) | +2.1 | Significant for error reduction |
LM integration | +0.9 | Helpful but not transformative |
SpecAugment | +2.4 | Crucial for generalization |
- Representative Transcription Examples:
Reference | Hypothesis | Analysis |
---|---|---|
”HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES" | "HE HOPED THERE WOULD BE STEW FOR DINNER TURNIPS AND CARROTS AND BRUISED POTATOES AND FAT MUTTON PIECES” | Perfect transcription |
”THE SUN SHINES BRIGHT ON THE OLD KENTUCKY HOME" | "THE SUN SHINES BRIGHT ON THE OLD KENTUCKY HOME” | Perfect transcription |
”ACCORDING TO AN OLD STORY THE FIRST ENGLISH ALMANAC WAS MADE IN OXFORD" | "ACCORDING TO AN OLD STORY THE FIRST ENGLISH ALMANACK WAS MADE IN OXFORD” | Minor spelling error only |
”CAPTAIN ARTHUR PHILLIP BECAME THE FIRST GOVERNOR OF NEW SOUTH WALES" | "CAPTAIN ARTHUR PHILIP BECAME THE FIRST GOVERNOR OF NEW SOUTH WALES” | Single character error in name |
Technical Challenges and Solutions
Language Modeling Challenges
-
Vanishing and Exploding Gradients
- Challenge: RNNs struggled with long-range dependencies due to unstable gradient flow
- Solution: Implemented gradient clipping at norm 0.25, forget gate bias initialization to 1.0, and skip connections between LSTM layers
-
Computational Efficiency
- Challenge: Training on the full WikiText-2 dataset was computationally expensive
- Solution: Employed mixed-precision training (FP16), gradient accumulation, and adaptive softmax for the output layer, reducing training time by 38%
-
Overfitting
- Challenge: Complex models quickly overfit despite regularization
- Solution: Developed a comprehensive regularization strategy combining variational dropout, weight decay, and early stopping with validation-based checkpointing
-
Efficient Text Generation
- Challenge: Naive generation algorithms were too slow for interactive use
- Solution: Implemented batch generation, caching of hidden states, and optimized top-k/nucleus sampling algorithms to improve generation speed by 5x
Speech Recognition Challenges
-
Variable-length Sequences
- Challenge: Handling variable-length audio and transcripts efficiently
- Solution: Implemented packed sequence processing, length-sorted batching, and dynamic computation graphs
-
Audio-Text Alignment
- Challenge: Standard attention mechanisms struggled with long audio sequences
- Solution: Developed location-aware attention with convolutional features and coverage tracking
-
Data Augmentation
- Challenge: Limited training data (100 hours) for robust ASR
- Solution: Extensive augmentation pipeline with SpecAugment, time stretching, and additive noise from MUSAN corpus
-
Inference Optimization
- Challenge: Beam search inference was prohibitively slow
- Solution: Custom CUDA kernels for beam search, pruning strategies, and early stopping heuristics
Implementation Insights
Critical Engineering Decisions
-
Modeling Units:
- For language modeling: Word-level tokens offered the best trade-off between sequence length and semantic meaning
- For ASR: Character-level modeling provided flexibility without out-of-vocabulary issues
-
Recurrent Unit Selection:
- LSTM outperformed GRU for language modeling due to superior handling of long dependencies
- Bidirectional GRU performed best for ASR encoding due to efficient computation and sufficient representational power
-
Attention Mechanism Design:
- Location-aware attention with convolutional features significantly outperformed standard content-based attention
- Adding coverage tracking prevented attention loops and redundant transcription
-
Regularization Strategy:
- Language model: Variational dropout proved most effective for sequential data
- ASR: SpecAugment with time and frequency masking provided the largest improvement in generalization
Toolkit and Infrastructure
Both models were implemented using:
- PyTorch 1.12 with CUDA 11.6 for GPU acceleration
- NumPy and librosa for audio preprocessing
- NVIDIA Apex for mixed-precision training
- WandB for experiment tracking
- Custom data loaders with multiprocessing for efficient I/O
- Training conducted on NVIDIA V100-32GB GPU
Future Directions
Language Model Improvements
-
Architecture Enhancements:
- Transformer-based architectures with rotary positional encodings
- Extending context length beyond current 35-token window
- Mixture-of-Experts (MoE) approaches for increased model capacity
-
Training Optimizations:
- Distributed training across multiple GPUs
- Curriculum learning based on sentence complexity
- Knowledge distillation from larger pre-trained models
-
Advanced Decoding:
- Constrained decoding for task-specific generation
- Reranking with external discriminators
- Controlled generation with attribute classifiers
Speech Recognition Advancements
-
Architectural Innovations:
- Conformer-based encoder combining CNNs and transformers
- Non-autoregressive decoding for faster inference
- Multi-task learning with phoneme recognition and speaker identification
-
Data Efficiency:
- Self-supervised pre-training on unlabeled audio
- Data synthesis with text-to-speech for augmentation
- Cross-lingual transfer learning
-
Practical Enhancements:
- Real-time processing capabilities
- Speaker adaptation for personalization
- Domain-specific language model adaptation
Conclusion
This project demonstrated the effectiveness of recurrent neural architectures for complex sequence modeling tasks across different modalities. The word-level language model achieved strong perplexity metrics while generating coherent and contextually appropriate text, showcasing the power of properly regularized LSTM networks. Meanwhile, the attention-based speech recognition system successfully bridged the gap between acoustic and linguistic representations, achieving competitive error rates on the challenging LibriSpeech benchmark.
The extensive exploration of architectural variants, regularization techniques, and training strategies provided valuable insights into the design trade-offs involved in building state-of-the-art sequence models. The ablation studies in particular highlighted the crucial components for each task, guiding future research directions.
Both components highlight the importance of careful implementation details, proper regularization, and effective training strategies when building neural sequence models. The insights gained from this project provide a solid foundation for future work in natural language processing and speech recognition systems.
Resources
- GitHub Repository: coming soon…
- Kaggle Competition Link: private for now…
- WikiText-2 Dataset
- LibriSpeech Dataset
- Project Report PDF: coming soon…
- Interactive Demo: coming soon…