Meet your training timelines and budgets with new Amazon SageMaker HyperPod flexible training plans

Today, we’re announcing the general availability of Amazon SageMaker HyperPod flexible training plans to help data scientists train large foundation models (FMs) within their timelines and budgets and save them weeks of effort in managing the training process based on compute availability.

At AWS re:Invent 2023, we introduced SageMaker HyperPod to reduce the time to train FMs by up to 40 percent and scale across thousands of compute resources in parallel with preconfigured distributed training libraries and built-in resiliency. Most generative AI model development tasks need accelerated compute resources in parallel. Our customers struggle to find timely access to compute resources to complete their training within their timeline and budget constraints.

With today’s announcement, you can find the required accelerated compute resources for training, create the most optimal training plans, and run training workloads across different blocks of capacity based on the availability of the compute resources. Within a few steps, you can identify training completion date, budget, compute resources requirements, create optimal training plans, and run fully managed training jobs, without needing manual intervention.

SageMaker HyperPod training plans in action
To get started, go to the Amazon SageMaker AI console, choose Training plans in the left navigation pane, and choose Create training plan.

For example, choose your preferred training date and time (10 days), instance type and count (16 ml.p5.48xlarge) for SageMaker HyperPod cluster, and choose Find training plan.

SageMaker HyperPod suggests a training plan that is split into two five-day segments. This includes the total upfront price for the plan.

If you accept this training plan, add your training details in the next step and choose Create your plan.

After creating your training plan, you can see the list of training plans. When you’ve created a training plan, you have to pay upfront for the plan within 12 hours. One plan is in the Active state and already started, with all the instances being used. The second plan is Scheduled to start later, but you can already submit jobs that start automatically when the plan begins.

In the active status, the compute resources are available in SageMaker HyperPod, resume automatically after pauses in availability, and terminates at the end of the plan. There is a first segment currently running and another segment queued up to run after the current segment.

This is similar to the Managed Spot training in SageMaker AI, where SageMaker AI takes care of instance interruptions and continues the training with no manual intervention. To learn more, visit the SageMaker HyperPod training plans in the Amazon SageMaker AI Developer Guide.

Now available
Amazon SageMaker HyperPod training plans are now available in US East (N. Virginia), US East (Ohio), US West (Oregon) AWS Regions and support ml.p4d.48xlarge, ml.p5.48xlarge, ml.p5e.48xlargeml.p5en.48xlarge, and ml.trn2.48xlarge instances. Trn2 and P5en instances are only in US East (Ohio) Region. To learn more, visit the SageMaker HyperPod product page and SageMaker AI pricing page.

Give HyperPod training plans a try in the Amazon SageMaker AI console and send feedback to AWS re:Post for SageMaker AI or through your usual AWS Support contacts.

Channy


Blog Article: Here

  • Related Posts

    Beyond CAD: How nTop Uses AI and Accelerated Computing to Enhance Product Design

    As a teenager, Bradley Rothenberg was obsessed with CAD: computer-aided design software. Before he turned 30, Rothenberg channeled that interest into building a startup, nTop, which today offers product developers — across vastly different industries — fast, highly iterative tools that help them model and create innovative, often deeply unorthodox designs. One of Rothenberg’s key
    Read Article

    Announcing up to 85% price reductions for Amazon S3 Express One Zone

    Amazon S3 Express One Zone announces significant price reductions, including reducing pricing for storage by 31%, PUTs by 55%, and GETS by 85%. In addition, S3 Express One Zone has reduced its per-gigabyte data upload and retrieval charges by 60% and now applies these charges to all bytes rather than just portions of requests exceeding 512 kilobytes.

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    You Missed

    Introducing sub-issues: Enhancing issue management on GitHub

    Introducing sub-issues: Enhancing issue management on GitHub

    9 business leaders on what’s possible with Google AI

    9 business leaders on what’s possible with Google AI

    What the heck is MCP and why is everyone talking about it?

    What the heck is MCP and why is everyone talking about it?

    The AI Paradox: Untangling Employee Hesitation to Unleash Agentic AI

    The AI Paradox: Untangling Employee Hesitation to Unleash Agentic AI

    Beyond CAD: How nTop Uses AI and Accelerated Computing to Enhance Product Design

    Beyond CAD: How nTop Uses AI and Accelerated Computing to Enhance Product Design

    6 highlights from Google Cloud Next 25

    6 highlights from Google Cloud Next 25