See you at Google Cloud Next?Let's meet →
PointFive
Back to Case Studies
Professional ServicesPwC

Full Visibility into AI Training Costs: How PwC Optimized a Custom LLM Pipeline Spanning 8 AWS Regions

PwC used PointFive to map and optimize a 32-billion parameter LLM training pipeline across 8 AWS regions and 5 NVIDIA GPU architectures, identifying 11-19% cost reduction opportunities.

PwC

~$180K

in annual identified savings

~19%

cost reduction

5

NVIDIA GPU architectures optimized

Overview

Client: PwC

Industry: Professional Services / AI Research

Cloud Provider: AWS

Challenge: PwC's AI research team is building a custom 32-billion parameter LLM across a complex, multi-region training pipeline using cutting-edge GPU infrastructure — from NVIDIA H200s to preview-stage B300 Blackwell GPUs. With workloads spread across 8 AWS regions, traditional FinOps tools couldn't map the full pipeline or pinpoint where waste was hiding.

Solution: PointFive's DeepWaste™ Detection Engine mapped PwC's entire LLM training pipeline end-to-end, surfacing actionable inefficiencies across compute, storage, and data transfer that traditional tools cannot see.

Results at a Glance

  • $78K/month in AI/ML infrastructure fully mapped and attributed across 8 regions
  • $9K–$15K/month in savings identified (11–19% cost reduction)
  • 5 NVIDIA GPU architectures optimized (Blackwell, Hopper, Ampere, Turing)
  • Continuous monitoring established for NVIDIA Blackwell GA pricing transition
  • Prioritized remediation plan delivered in actionable, phased engineering steps

Background

PwC, one of the world's leading professional services firms, is investing heavily in custom AI capabilities. The firm's AI research team operates a platform for training a custom 8-billion parameter large language model built on NVIDIA MegatronLM 2.0 and Amazon SageMaker HyperPod.

The training platform is a full ML pipeline: high-memory CPU instances handle data preparation and tokenization, GPU instances power instruction fine-tuning and model evaluation, and SageMaker HyperPod clusters run distributed training — all coordinated across 8 AWS regions with FSx for Lustre providing high-throughput storage and S3 managing checkpoint distribution.

At approximately $78K/month, with 99.6% of spend directly tied to AI/ML workloads, even modest percentage improvements translate into meaningful savings. The team needed confidence that optimizations wouldn't disrupt a training pipeline where a single misconfiguration could waste days of GPU time.

Objectives

  • Map the full AI training pipeline across all compute, storage, and data transfer components spanning 8 AWS regions
  • Identify waste in GPU and ML infrastructure across SageMaker, EC2 GPU instances, FSx storage tiers, and cross-region data flows
  • Prepare for Blackwell cost impact before NVIDIA Blackwell (P6) instances transition from AWS Preview to GA pricing
  • Deliver actionable, prioritized recommendations with engineering-ready remediation steps

Challenges

Multi-region pipeline complexity: The training pipeline spans us-east-1 (checkpoint hub), ap-south-1 (primary GPU training), us-east-2 (HyperPod + data prep), us-west-2 (Blackwell testing), and 4 additional regions. A checkpoint transfer cost in us-east-1 is meaningless without understanding it feeds a training job in ap-south-1.

Mixed compute paradigms: GPU workloads run across standalone EC2 instances, SageMaker HyperPod clusters, SageMaker Training Plans, Capacity Block reservations, and AWS Preview instances — each with different pricing models. No single AWS tool provides a unified view.

Cutting-edge hardware with no pricing history: PwC is among the earliest adopters of NVIDIA Blackwell GPUs, currently running 689 hours/month at $0 during AWS Preview. When GA pricing takes effect, this becomes a significant new cost center with no historical data to plan around.

High stakes, low tolerance for disruption: Training an 8B-parameter model is a multi-week process. The team needed optimization recommendations they could trust.

Solution

PwC adopted PointFive to bring structure and visibility to their LLM training infrastructure.

End-to-End Pipeline Mapping

The DeepWaste™ Detection Engine identified and mapped every component of the training pipeline, attributing costs and data flows across all 8 regions.

Multi-Layer Cost Analysis

Pipeline Stage Annual Cost Key Resources
Data Preparation & Tokenization $139K r6a.48xlarge, r8i.metal-96xl, c6id
GPU Training & Fine-Tuning $123K P5.4xlarge (H100), P4de (A100), G5 (A10G)
SageMaker HyperPod $193K ml.g5.12xlarge, ml.m5.12xlarge/16xlarge
High-Performance Storage $156K FSx for Lustre (1000 MB/s and 250 MB/s tiers)
Checkpoint Storage & Distribution $78K S3 + cross-region transfer
Development Environments $58K SageMaker notebooks, Studio

Key Discoveries

  • $33K/year in dormant snapshot storage — ideal for archive tier at 75% savings
  • GPU notebooks running 24/7 for ~35% utilization — a straightforward lifecycle policy fix
  • Over-provisioned FSx throughput when actual I/O patterns could be served by a lower tier
  • Cross-region checkpoint transfer waste without S3 Cross-Region Replication
  • Development instances running outside working hours

Results

$108K–$180K/year in Identified Savings (11–19% of Total Spend)

Priority Optimization Annual Savings Effort
Critical EBS snapshot archival $24K–$33K 1 hour
High SageMaker notebook lifecycle policies $18K–$30K 2 hours
High S3 Intelligent-Tiering for checkpoints $10K–$14K 1 hour
High Data transfer optimization (VPC endpoints + CRR) $10K–$18K 4 hours
Medium EBS volume type modernization (gp2 → gp3) $10K–$18K 2 hours
Medium Instance scheduling for dev/eval resources $12K–$24K 4 hours
Strategic Regional consolidation assessment $18K–$30K 2 weeks

Validated existing good practices: PointFive confirmed that PwC's use of SageMaker Training Plans and EC2 Capacity Blocks for H100 reservations were well-optimized.

Blackwell cost preparedness: With 689 hours/month of P6 Blackwell GPU usage currently at $0, PointFive established a monitoring baseline for when GA pricing takes effect — estimated at $50K–$100K/month.

Full pipeline visibility: For the first time, PwC's AI research team had a single view connecting data preparation costs to training costs to checkpoint distribution costs across regions.

Conclusion

PwC's LLM training platform represents a new class of cloud workload: multi-region, multi-architecture, rapidly evolving, and mission-critical. Traditional FinOps tools see services and line items, not training pipelines and data flows.

PointFive mapped a full LLM training pipeline across 8 AWS regions, 5 NVIDIA GPU generations, and multiple compute paradigms into a coherent cost picture with prioritized, engineering-ready optimizations.

With $9K–$15K/month in savings identified and continuous monitoring in place for the coming Blackwell cost transition, PwC is positioned to scale its custom AI capabilities with cost efficiency built into the foundation.

About PointFive

PointFive redefines how enterprises continuously optimize cloud, infrastructure, and AI environments. By combining a real-time cloud and infrastructure data fabric with AI-driven detection and guided remediation, PointFive transforms efficiency from a reporting exercise into an operational discipline. Customers achieve sustained improvements in cost, performance, reliability, and engineering accountability at scale.

To learn more, book a demo.

Savings by Service

~$33K/yr

EBS Snapshot Archival

~$30K/yr

SageMaker Notebook Lifecycle Policies

~$18K/yr

Data Transfer Optimization

~$30K/yr

Regional Consolidation

Ready to find your hidden savings?

Get a quantified savings report in 48 hours, no agents, no risk.

Book a Demo