Wednesday, December 3, 2025

Introducing checkpointless and elastic training on Amazon SageMaker HyperPod

Today, we’re announcing two new AI model training features within Amazon SageMaker HyperPod: checkpointless training, an approach that mitigates the need for traditional checkpoint-based recovery by enabling peer-to-peer state recovery, and elastic training, enabling AI workloads to automatically scale based on resource availability.

  • Checkpointless training – Checkpointless training eliminates disruptive checkpoint-restart cycles, maintaining forward training momentum despite failures, reducing recovery time from hours to minutes. Accelerate your AI model development, reclaim days from development timelines, and confidently scale training workflows to thousands of AI accelerators.
  • Elastic training  – Elastic training maximizes cluster utilization as training workloads automatically expand to use idle capacity as it becomes available, and contract to yield resources as higher-priority workloads like inference volumes peak. Save hours of engineering time per week spent reconfiguring training jobs based on compute availability.

Rather than spending time managing training infrastructure, these new training techniques mean that your team can concentrate entirely on enhancing model performance, ultimately getting your AI models to market faster. By eliminating the traditional checkpoint dependencies and fully utilizing available capacity, you can significantly reduce model training completion times.

Checkpointless training: How it works
Traditional checkpoint-based recovery has these sequential job stages: 1) job termination and restart, 2) process discovery and network setup, 3) checkpoint retrieval, 4) data loader initialization, and 5) training loop resumption. When failures occur, each stage can become a bottleneck and training recovery can take up to an hour on self-managed training clusters. The entire cluster must wait for every single stage to complete before training can resume. This can lead to the entire training cluster sitting idle during recovery operations, which increases costs and extends the time to market.

Checkpointless training removes this bottleneck entirely by maintaining continuous model state preservation across the training cluster. When failures occur, the system instantly recovers by using healthy peers, avoiding the need for a checkpoint-based recovery that requires restarting the entire job. As a result, checkpointless training enables fault recovery in minutes.

Checkpointless training is designed for incremental adoption and built on four core components that work together: 1) collective communications initialization optimizations, 2) memory-mapped data loading that enables caching, 3) in-process recovery, and 4) checkpointless peer-to-peer state replication. These components are orchestrated through the HyperPod training operator that is used to launch the job. Each component optimizes a specific step in the recovery process, and together they enable automatic detection and recovery of infrastructure faults in minutes with zero manual intervention, even with thousands of AI accelerators. You can progressively enable each of these features as your training scales.

The latest Amazon Nova models were trained using this technology on tens of thousands of accelerators. Additionally, based on internal studies on cluster sizes ranging between 16 GPUs to over 2,000 GPUs, checkpointless training showcased significant improvements in recovery times, reducing downtime by over 80% compared to traditional checkpoint-based recovery.

To learn more, visit HyperPod Checkpointless Training in the Amazon SageMaker AI Developer Guide.

Elastic training: How it works
On clusters that run different types of modern AI workloads, accelerator availability can change continuously throughout the day as short-duration training runs complete, inference spikes occur and subside, or resources free up from completed experiments. Despite this dynamic availability of AI accelerators, traditional training workloads remain locked into their initial compute allocation, unable to take advantage of idle accelerators without manual intervention. This rigidity leaves valuable GPU capacity unused and prevents organizations from maximizing their infrastructure investment.

Elastic training transforms how training workloads interact with cluster resources. Training jobs can automatically scale up to utilize available accelerators and gracefully contract when resources are needed elsewhere, all while maintaining training quality.

Workload elasticity is enabled through the HyperPod training operator that orchestrates scaling decisions through integration with the Kubernetes control plane and resource scheduler. It continuously monitors cluster state through three primary channels: pod lifecycle events, node availability changes, and resource scheduler priority signals. This comprehensive monitoring enables near-instantaneous detection of scaling opportunities, whether from newly available resources or requests from higher-priority workloads.

The scaling mechanism relies on adding and removing data parallel replicas. When additional compute resources become available, new data parallel replicas join the training job, accelerating throughput. Conversely, during scale-down events (for example, when a higher-priority workload requests resources), the system scales down by removing replicas rather than terminating the entire job, allowing training to continue at reduced capacity.

Across different scales, the system preserves the global batch size and adapts learning rates, preventing model convergence from being adversely impacted. This enables workloads to dynamically scale up or down to utilize available AI accelerators without any manual intervention.

You can start elastic training through the HyperPod recipes for publicly available foundation models (FMs) including Llama and GPT-OSS. Additionally, you can modify your PyTorch training scripts to add elastic event handlers, which enable the job to dynamically scale.

To learn more, visit the HyperPod Elastic Training in the Amazon SageMaker AI Developer Guide. To get started, find the HyperPod recipes available in the AWS GitHub repository.

Now available
Both features are available in all the Regions in which Amazon SageMaker HyperPod is available. You can use these training techniques without additional cost. To learn more, visit the SageMaker HyperPod product page and SageMaker AI pricing page.

Give it a try and send feedback to AWS re:Post for SageMaker or through your usual AWS Support contacts.

Channy



from AWS News Blog https://ift.tt/BqLPXhQ
via IFTTT

Tuesday, December 2, 2025

Introducing Amazon Nova Forge: Build your own frontier models using Nova

Organizations are rapidly expanding their use of generative AI across all parts of the business. Applications requiring deep domain expertise or specific business context need models that truly understand their proprietary knowledge, workflows, and unique requirements.

While techniques like prompt engineering and Retrieval Augmented Generation (RAG) work well for many use cases, they have fundamental limitations when it comes to embedding specialized knowledge into a model’s core understanding. Supervised fine-tuning and reinforcement learning help in customizing the model, but they operate too late in the development lifecycle, layering modifications on top of models that are a fully trained, and therefore difficult to steer to specific domains of interest.

When organizations attempt deeper customization through Continued Pre-Training (CPT) using only their proprietary data, they often encounter catastrophic forgetting, where models lose their foundational capabilities as they learn new content. At the same time, the data, compute, and cost needed for training a model from scratch are still a prohibitive barrier for most organizations.

Today, we’re introducing Amazon Nova Forge, a new service to build your own frontier models using Nova. Nova Forge customers can start their development from early model checkpoints, blend their datasets with Amazon Nova-curated training data, and host their custom models securely on AWS. Nova Forge is the easiest and most cost-effective way to build your own frontier model.

Use cases and applications
Nova Forge is designed for organizations with access to proprietary or industry-specific data who want to build AI that truly understands their domain. This includes:

  • Manufacturing and automation – Building models that understand specialized processes, equipment data, and industry-specific workflows
  • Research and development – Creating models trained on proprietary research data and domain-specific knowledge
  • Content and media – Developing models that understand brand voice, content standards, and specific moderation requirements
  • Specialized industries – Training models on industry-specific terminology, regulations, and best practices

Depending on the specific use cases, Nova Forge can be used to add differentiated capabilities, enhance task-specific accuracy, reduce costs, and lower latency.

How Nova Forge works
Nova Forge addresses the limitations of current customization approaches by allowing you to start model development from early checkpoints across pre-training, mid-training, and post-training phases. You can blend your proprietary data with Amazon Nova-curated data throughout all training phases, running training using proven recipes on Amazon SageMaker AI fully managed infrastructure. This data mixing approach significantly reduces catastrophic forgetting compared to training with raw data alone, helping preserve foundational skills—including core intelligence, general instruction following capabilities, and safety benefits—while incorporating your specialized knowledge.

Nova Forge provides the ability to use reward functions in your own environment for reinforcement learning (RL). This allows the model to learn from feedback generated in environments that are representative of your use cases. Beyond single-step evaluations, you can also use your own orchestrator to manage multi-turn rollouts, enabling RL training for complex agent workflows and sequential decision-making tasks. Whether you’re using chemistry tools to score molecular designs, or robotics simulations that reward efficient task completion and penalize collisions, you can connect your proprietary environments directly.

You can also take advantage of the built-in responsible AI toolkit available in Nova Forge to configure the safety and content moderation settings of your model. You can adjust settings to meet your specific business needs in areas like safety, security, and handling of sensitive content.

Getting started with Nova Forge
Nova Forge integrates seamlessly with your existing AWS workflows. You can use the familiar tools and infrastructure in Amazon SageMaker AI to run your training, then import your custom Nova models as private models on Amazon Bedrock. This gives you the same security, consistent APIs, and broader AWS integrations as any model in Amazon Bedrock.

In Amazon SageMaker Studio, you can now build your frontier model with Amazon Nova.

Amazon Nova Forge in the SageMaker AI console

To start building the model, choose which checkpoint to use: pre-trained, mid-trained, or post-trained. You can also upload your dataset here or use existing datasets.

Amazon Nova Forge checkpoints

You can blend your training data by mixing in curated datasets provided by Nova. These datasets, categorized by domain, can help your model to preserve general performance and prevent overfitting or catastrophic forgetting.

Amazon Nova Forge data mixing

Optionally, you can choose to use Reinforcement Fine-Tuning (RFT) to improve factual accuracy and reduce hallucinations in specific domains.

When training completes, import the model into Amazon Bedrock and start using it in your applications.

Things to know
Amazon Nova Forge is available in the US East (N. Virginia) AWS Region. The program includes access to multiple Nova model checkpoints, training recipes to mix proprietary data with Amazon Nova-curated training data, proven training recipes, and integration with Amazon SageMaker AI and Amazon Bedrock.

Learn more in the Amazon Nova User Guide and explore Nova Forge from the Amazon SageMaker AI console.

Organizations interested in expert assistance can also reach out to our Generative AI Innovation Center for additional support with their model development initiatives.

Danilo



from AWS News Blog https://ift.tt/BClcjS4
via IFTTT

Monday, December 1, 2025

Introducing AWS Transform custom: Crush tech debt with AI-powered code modernization

Technical debt is one of the most persistent challenges facing enterprise development teams today. Studies show that organizations spend 20% of their IT budget on technical debt instead of advancing new capabilities. Whether it’s upgrading legacy frameworks, migrating to newer runtime versions, or refactoring outdated code patterns, these essential but repetitive tasks consume valuable developer time that could be spent on innovation.

Today, we’re excited to announce AWS Transform custom, a new agent that fundamentally changes how organizations approach modernization at scale. This intelligent agent combines pre-built transformations for Java, Node.js, and Python upgrades with the ability to define custom transformations. By learning specific transformation patterns and automating them across entire codebases, customers using AWS Transform custom have achieved up to 80% reduction in execution time in many cases, freeing developers to focus on innovation.

You can define transformations using your documentation, natural language descriptions, and code samples. The service then applies these specific patterns consistently across hundreds or thousands of repositories, improving its effectiveness through both explicit feedback and implicit signals like developers’ manual fixes within your transformation projects.

AWS Transform custom offers both CLI and web interfaces to suit different modernization needs. You can use the CLI to define transformations through natural language interactions and execute them on local codebases, either interactively or autonomously. You can also integrate it into code modernization pipelines or workflows, making it ideal for machine-driven automation. Meanwhile, the web interface provides comprehensive campaign management capabilities, helping teams track and coordinate transformation progress across multiple repositories at scale.

Language and framework modernization
AWS Transform supports runtime upgrades without the need to provide additional information, understanding not only the syntax changes required but also the subtle behavioral differences and optimization opportunities that come with newer versions. The same intelligent approach applies to Node.js, Python and Java runtime upgrades, and even extends to infrastructure-level transitions, such as migrating workloads from x86 processors to AWS Graviton.

It also navigates framework modernization with sophistication. When organizations need to update their Spring Boot applications to take advantage of newer features and security patches, AWS Transform custom doesn’t merely update version numbers but understands the cascading effects of dependency changes, configuration updates, and API modifications.

For teams facing more dramatic shifts, such as migrating from Angular to React, AWS Transform custom can learn the patterns of component translation, state management conversion, and routing logic transformation that make such migrations successful.

Infrastructure and enterprise-scale transformations
The challenge of keeping up with evolving APIs and SDKs becomes particularly acute in cloud-based environments where services are continuously improving. AWS Transform custom supports AWS SDK updates across a broad spectrum of programming languages that enterprises use including Java, Python, and JavaScript. The service understands not only the mechanical aspects of API changes, but also recognizes best practices and optimization opportunities available in newer SDK versions.

Infrastructure as Code transformations represent another critical capability, especially as organizations evaluate different tooling strategies. Whether you’re converting AWS Cloud Development Kit (AWS CDK) templates to Terraform for standardization purposes, or updating AWS CloudFormation configurations to access new service features, AWS Transform custom understands the declarative nature of these tools and can maintain the intent and structure of your infrastructure definitions.

Beyond these common scenarios, AWS Transform custom excels at addressing the unique, organization-specific code patterns that accumulate over years of development. Every enterprise has its own architectural conventions, utility libraries, and coding standards that need to evolve over time. It can learn these custom patterns and help refactor them systematically so that institutional knowledge and best practices are applied consistently across the entire application portfolio.

AWS Transform custom is designed with enterprise development workflows in mind, enabling center of excellence teams and system integrators to define and execute organization-wide transformations while application developers focus on reviewing and integrating the transformed code. DevOps engineers can then configure integrations with existing continuous integration and continuous delivery (CI/CD) pipelines and source control systems. It also includes pre-built transformations for Java, Node.js and Python runtime updates which can be particularly useful for AWS Lambda functions, along with transformations for AWS SDK modernization to help teams get started immediately.

Getting started
AWS Transform makes complex code transformations manageable through both pre-built and custom transformation capabilities. Let’s start by exploring how to use an existing transformation to address a common modernization challenge: upgrading AWS Lambda functions due to end-of-life (EOL) runtime support.

For this example, I’ll demonstrate migrating a Python 3.8 Lambda function to Python 3.13, as Python 3.8 reached EOL and is no longer receiving security updates. I’ll use the CLI for this demo, but I encourage you to also explore the web interface’s powerful campaign management capabilities.

First, I use the command atx custom def list to explore the available transformation definitions. You can also access this functionality through a conversational interface by typing only atx instead of issuing the command directly, if you prefer.

This command displays all available transformations, including both AWS-managed defaults and any existing custom transformations created by users in my organization. AWS-managed transformations are identified by the AWS/ prefix, indicating they’re maintained and updated by AWS. In the results, I can see several options such as AWS/java-version-upgrade for Java runtime modernization, AWS/python-boto2-to-boto3-migration for updating Python AWS SDK usage, AWS/nodejs-version-upgrade for Node.js runtime updates.

For my Python 3.8 to 3.13 migration, I’ll use the AWS/python-version-upgrade transformation.

You run a migration by using the atx custom def exec command.  Please consult the documentation for more details about the command and all its options. Here, I run it against my project repository specifying the transformation name. I also add pytest to run unit tests for validation. More importantly, I use the additionalPlanContext section in the  --configuration input to specify which Python version I want to upgrade to. For reference, here’s the command I have for my demo (I’ve used multiple lines and indented it here for clarity):

atx custom def exec 
-p /mnt/c/Users/vasudeve/Documents/Work/Projects/ATX/lambda/todoapilambda 
-n AWS/python-version-upgrade
-C "pytest" 
--configuration 
    "additionalPlanContext= The target Python version to upgrade to is Python 3.13" 
-x -t

AWS Transform then starts the migration process. It analyzes my Lambda function code, identifies Python 3.8-specific patterns, and automatically applies the necessary changes for Python 3.13 compatibility. This includes updating syntax for deprecated features, modifying import statements, and adjusting any version-specific behaviors.

After execution, it provides a comprehensive summary including a report on dependencies updated in requirements.txt with Python 3.13-compatible package versions, instances of deprecated syntax replaced with current equivalents, updated runtime configuration notes for AWS Lambda deployment, suggested test cases to validate the migration, and more. It also provides a body of evidence that serve as proof of success.

The migrated code lives in a local branch so you can review and merge when satisfied. Alternatively, you can keep providing feedback and reiterating until yo’re happy that the migration is fully complete and meets your expectations.

This automated process changes what would typically require hours of manual work into a streamlined, consistent upgrade that maintains code quality while maintaining compatibility with the newer Python runtime.

Creating a new custom transformation
While AWS-managed transformations handle common scenarios effectively, you can also create custom transformations tailored to your organization’s specific needs. Let’s explore how to create a custom transformation to see how AWS Transform learns from your specific requirements.

I type atx to initialize the atx cli and start the process.

The first thing it asks me is if I want to use one of the existing transformations or create a new one. I choose to create a new one. Notice that from here on the whole conversation takes place using natural language, not commands. I typed new one but I could have typed I want to create a new one and it would’ve understood it exactly the same.

It then prompts me to provide more information about the kind of transformation I’d like to perform. For this demo, I’m going to migrate an Angular application, so I type angular 16 to 19 application migration which prompts the CLI to search for all transformations available for this type of migration. In my case, my team has already created and made available a few Angular migrations, so it shows me those. However, it warns me that none of them is an exact match to my specific request for migrating from Angular 16 to 19. It then asks if I’d like to select from one of the existing transformations listed or create a custom one.

I choose to create a custom one by continuing to use natural language and typing create a new one as a command. Again, this could be any variation of that statement provided that you indicate your intentions clearly. It follows by asking me a few questions including whether I have any useful documentation, example code or migration guides that I can provide to help customize the transformation plan.

For this demo, I’m only going to rely on AWS Transform to provide me with good defaults. I type I don't have these details. Follow best practices. and the CLI responds by telling me that it will create a comprehensive transformation definition for migrating Angular 16 to Angular 19.  Of course, I relied on the pre-trained data to generate results based on best practices. As usual, the recommendation is to provide as much information and relevant data as possible at this stage of the process for better results. However, you don’t need to have all the data upfront. You can keep on providing data at any time› as you iterate through the process of creating the custom transformation definition.

The transformation definition is generated as a markup file containing a summary and a comprehensive sequence of implementation steps grouped logically into phases such as premigration preparation, processing and partitioning, static dependency analysis, searching and applying specific transformation rules, and step-by-step migration and iterative validation.

It’s interesting to see that AWS Transform opted for the best practice of doing incremental framework updates creating steps for migrating the application first to 17 then 18 then 19 instead of trying to go directly from 16 to 19 to minimize issues.

Note that the plan includes various stages of testing and verification to confirm that the various phases can be concluded with confidence. At the very end, it also includes a final validation stage listing exit criteria that performs a comprehensive set of tests against all aspects of the application that will be used to accept the migration as successfully complete.

After the transformation definition is created, AWS Transform asks me about what I would like to do next. I can choose to review or modify the transformation definition and I can reiterate through this process as much as I need until I arrive at one that I’m satisfied with. I can also choose to already apply this transformation definition to an Angular codebase. However, first I want to make this transformation available to my team members as well as myself so we can all use it again in the future. So, I choose option 4 to publish this transformation to the registry.

This custom transformation needs a name and a description of its objective which is displayed when users browse the registry. AWS Transforms automatically extracts those from context for me and asks me if I would like to modify them before going ahead. I like the sensible default of “Angular-16-to-19-Migration”, and the objective is clearly stated, so I choose to accept the suggestions and publish it by answering with yes, looks good.

Now that the transformation definition is created and published, I can use it and run it multiple times against any code repository. Let’s apply the transformation to a code repository with a project written in Angular 16. I now choose option 1 from the follow-up prompt and the CLI asks me for the path in my file system to the application that I want to migrate and, optionally, the build command that it should use.

After I provide that information, AWS Transform proceeds to analyze the code base and formulate a thorough step-by-step transformation plan based on the definition created earlier. After it’s done, it creates a JSON file containing the detailed migration plan specifically designed for applying our transformation definition to this code base. Similar to the process of creating the transformation definition, you can review and iterate through this plan as much as you need, providing it with feedback and adjusting it to any specific requirements you might have.

When I’m ready to accept the plan, I can use natural language to tell AWS Transform that we can start the migration process. I type looks good, proceed and watch the progress in my shell as it starts executing the plan and making the changes to my code base one step at a time.

The time it takes will vary depending on the complexity of the application. In my case, it took a few minutes to complete. After it has finished, it provides me with a transformation summary and the status of each one of the exit criteria that were included in the final verification phase of the plan alongside all the evidence to support the reported status. For example, the Application Build – Production criteria was listed as passed and some of the evidence provided included the incremental Git commits, the time that it took to complete the production build, the bundle size, the build output message, and the details about all the output files created.

Conclusion
AWS Transform represents a fundamental shift in how organizations approach code modernization and technical debt. The service helps to transform what was at one time a fragmented, team-by-team effort into a unified, intelligent capability that eliminates knowledge silos, keeping your best practices and institutional knowledge available as scalable assets across the entire organization. This helps to accelerate modernization initiatives while freeing developers to spend more time on innovation and driving business value instead of focusing on repetitive maintenance and modernization tasks.

Things to know

AWS Transform custom is now generally available. Visit the get started guide to start your first transformation campaign or check out the documentation to learn more about setting up custom transformation definitions.



from AWS News Blog https://ift.tt/Ps47FOJ
via IFTTT

Sunday, November 30, 2025

Announcing Amazon EKS Capabilities for workload orchestration and cloud resource management

Today, we’re announcing Amazon Elastic Kubernetes Service (Amazon EKS) Capabilities, an extensible set of Kubernetes-native solutions that streamline workload orchestration, Amazon Web Services (AWS) cloud resource management, and Kubernetes resource composition and orchestration. These fully managed, integrated platform capabilities include open source Kubernetes solutions that many customers are using today, such as Argo CD, AWS Controllers for Kubernetes, and Kube Resource Orchestrator.

With EKS Capabilities, you can build and scale Kubernetes applications without managing complex solution infrastructure. Unlike typical in-cluster installations, these capabilities actually run in EKS service-owned accounts that are fully abstracted from customers.

With AWS managing infrastructure scaling, patching, and updates of these cluster capabilities, you can use the enterprise reliability and security without needing to maintain and manage the underlying components.

Here are the capabilities available at launch:

  • Argo CD – This is a declarative GitOps tool for Kubernetes that provides continuous continuous deployment (CD) capabilities for Kubernetes. It’s broadly adopted, with more than 45% of Kubernetes end-users reporting production or planned production use in the 2024 Cloud Native Computing Foundation (CNCF) Survey.
  • AWS Controllers for Kubernetes (ACK) – ACK is highly popular with enterprise platform teams in production environments. ACK provides custom resources for Kubernetes that enable the management of AWS Cloud resources directly from within your clusters.
  • Kube Resource Orchestrator (KRO) – KRO provides a streamlined way to create and manage custom resources in Kubernetes. With KRO, platform teams can create reusable resource bundles that abstract away complexity while remaining natively to the Kubernetes ecosystem.

With these features, you can accelerate and scale your Kubernetes use with fully managed capabilities, using its opinionated but flexible features to build for scale right from the start. It is designed to offer a set of foundational cluster capabilities that layer seamlessly with each other, providing integrated features for continuous deployment, resource orchestration, and composition. You can focus on managing and shipping software without needing to spend time and resources building and managing these foundational platform components.

How it works
Platform engineers and cluster administrators can set up EKS Capabilities to offload building and managing custom solutions to provide common foundational services, meaning they can focus on more differentiated features that matter to your business.

Your application developers primarily work with EKS Capabilities as they do other Kubernetes features. They do this by applying declarative configuration to create Kubernetes resources using familiar tools, such as kubectl or through automation from git commit to running code.

Get started with EKS Capabilities
To enable EKS Capabilities, you can use the EKS console, AWS Command Line Interface (AWS CLI), eksctl, or other preferred tools. In the EKS console, choose Create capabilities in the Capabilities tab on your existing EKS cluster. EKS Capabilities are AWS resources, and they can be tagged, managed, and deleted.

You can select one or more capabilities to work together. I checked all three capabilities: ArgoCD, ACK, and KRO. However, these capabilities are completely independent and you can pick and choose which capabilities you want enabled on your clusters.

Now you can configure selected capabilities. You should create AWS Identity and Access Management (AWS IAM) roles to enable EKS to operate these capabilities within your cluster. Please note you cannot modify the capability name, namespace, authentication region, or AWS IAM Identity Center instance after creating the capability. Choose Next and review the settings and enable capabilities.

Now you can see and manage created capabilities. Select ArgoCD to update configuration of the capability.

You can see details of ArgoCD capability. Choose Edit to change configuration settings or Monitor ArgoCD to show the health status of the capability for the current EKS cluster.

Choose Go to Argo UI to visualize and monitor deployment status and application health.

To learn more about how to set up and use each capability in detail, visit Getting started with EKS Capabilities in the Amazon EKS User Guide.

Things to know
Here are key considerations to know about this feature:

  • Permissions – EKS Capabilities are cluster-scoped administrator resources, and resource permissions are configured through AWS IAM. For some capabilities, there is additional configuration for single sign-on. For example, Argo CD single sign-on configuration is enabled directly in EKS with a direct integration with IAM Identity Center.
  • Upgrades – EKS automatically updates cluster capabilities you enable and their related dependencies. It automatically analyzes for breaking changes, patches and updates components as needed, and informs you of conflicts or issues through the EKS cluster insights.
  • Adoptions – ACK provides resource adoption features that enable migration of existing AWS resources into ACK management. ACK also provides read-only resources which can help facilitate a step-wise migration from provisioned resources with Terraform, AWS CloudFormation into EKS Capabilities.

Now available
Amazon EKS Capabilities are now available in commercial AWS Regions. For Regional availability and future roadmap, visit the AWS Capabilities by Region. There are no upfront commitments or minimum fees, and you only pay for the EKS Capabilities and resources that you use. To learn more, visit the EKS pricing page.

Give it a try in the Amazon EKS console and send feedback to AWS re:Post for EKS or through your usual AWS Support contacts.

Channy



from AWS News Blog https://ift.tt/k8LjoPU
via IFTTT

Wednesday, November 26, 2025

Amazon Route 53 launches Accelerated recovery for managing public DNS records

Today, we’re announcing Amazon Route 53 Accelerated recovery for managing public DNS records, a new Domain Name Service (DNS) business continuity feature that is designed to provide a 60-minute recovery time objective (RTO) during service disruptions in the US East (N. Virginia) AWS Region. This enhancement ensures that customers can continue making DNS changes and provisioning infrastructure even during regional outages, providing greater predictability and resilience for mission-critical applications.

Customers running applications that require business continuity have told us they need additional DNS resilience capabilities to meet their business continuity requirements and regulatory compliance obligations. While AWS maintains exceptional availability across our global infrastructure, organizations in regulated industries like banking, FinTech, and SaaS want the confidence that they will be able to make DNS changes even during unexpected regional disruptions, allowing them to quickly provision standby cloud resources or redirect traffic when needed.

Accelerated recovery for managing public DNS records addresses this need by targeting DNS changes that customers can make within 60 minutes of a service disruption in the US East (N. Virginia) Region. The feature works seamlessly with your existing Route 53 setup, providing access to key Route 53 API operations during failover scenarios, including ChangeResourceRecordSets, GetChange, ListHostedZones, and ListResourceRecordSets. Customers can continue using their existing Route 53 API endpoint without modifying applications or scripts.

Let’s try it out
Configuring a Route53 hosted zone to use accelerated recovery is simple. Here I am creating a new hosted zone for a new website I’m building.

Once I have created my hosted zone, I see a new tab labeled Accelerated recovery. I can see here that accelerated recovery is disabled by default.

To enable it, I just need to click the Enable button and confirm my choice in the modal that appears as depicted in the dialog below.

Enabling accelerated recovery will take a couple minutes to complete. Once it’s enabled, I see a green Enabled status as depicted in the screenshot below.

I can disable accelerated recovery at any time from this same area of the AWS Management Console. I can also enable accelerated recovery for any existing hosted zones I have already created.

Enhanced DNS business continuity
With accelerated recovery enabled, customers gain several key capabilities during service disruptions. The feature maintains access to essential Route 53 API operations, ensuring that DNS management remains available when it’s needed most. Organizations can continue to make critical DNS changes, provision new infrastructure, and redirect traffic flows without waiting for full service restoration.

The implementation is designed for simplicity and reliability. Customers don’t need to learn new APIs or modify existing automation scripts. The same Route 53 endpoints and API calls continue to work, providing a seamless experience during both normal operations and failover scenarios.

Now available
Accelerated recovery for Amazon Route 53 public hosted zones is available now. You can enable this feature through the AWS Management Console, AWS Command Line Interface (AWS CLI), AWS Software Development Kit (AWS SDKs), or infrastructure as code tools like AWS CloudFormation and AWS Cloud Development Kit (AWS CDK). There is no additional cost for using accelerated recovery.

To learn more about accelerated recovery and get started, visit the documentation. This new capability represents our continued commitment to providing customers with the DNS resilience they need to build and operate mission-critical applications in the cloud.



from AWS News Blog https://ift.tt/C2wbS9x
via IFTTT

Monday, November 24, 2025

AWS Weekly Roundup: How to join AWS re:Invent 2025, plus Kiro GA, and lots of launches (Nov 24, 2025)

Next week, don’t miss AWS re:Invent, Dec. 1-5, 2025, for the latest AWS news, expert insights, and global cloud community connections! Our News Blog team is finalizing posts to introduce the most exciting launches from our service teams. If you’re joining us in person in Las Vegas, review the agenda, session catalog, and attendee guides before arriving. Can’t attend in person? Watch our Keynotes and Innovation Talks via livestream.

Kiro is now generally available
Last week, Kiro, the first AI coding tool built around spec-driven development, became generally available. This tool, which we pioneered to bring more clarity and structure to agentic workflows, has already been embraced by over 250,000 developers since its preview release. The GA launch introduces four new capabilities: property-based testing for spec correctness (which measures whether your code matches what you specified); a new way to checkpoint your progress on Kiro; a new Kiro CLI bringing agents to your terminal; and enterprise team plans with centralized management.

Last week’s launches
We’ve announced numerous new feature and service launches as we approach re:Invent week. Key launches include:

Here are some AWS bundled feature launches:

See AWS What’s New for more launch news that I haven’t covered here, and we’ll see you next week at re:Invent!

Channy



from AWS News Blog https://ift.tt/in0GmtK
via IFTTT

Friday, November 21, 2025

New one-click onboarding and notebooks with a built-in AI agent in Amazon SageMaker Unified Studio

Today we’re announcing a faster way to get started with your existing AWS datasets in Amazon SageMaker Unified Studio. You can now start working with any data you have access to in a new serverless notebook with a built-in AI agent, using your existing AWS Identity and Access Management (IAM) roles and permissions.

New updates include:

  • One-click onboarding – Amazon SageMaker can now automatically create a project in Unified Studio with all your existing data permissions from AWS Glue Data Catalog, AWS Lake Formation, and Amazon Simple Storage Services (Amazon S3).
  • Direct integration – You can launch SageMaker Unified Studio directly from Amazon SageMaker, Amazon Athena, Amazon Redshift, and Amazon S3 Tables console pages, giving a fast path to analytics and AI workloads.
  • Notebooks with a built-in AI agent – You can use a new serverless notebook with a built-in AI agent, which supports SQL, Python, Spark, or natural language and gives data engineers, analysts, and data scientists one place to develop and run both SQL queries and code.

You also have access to other tools such as a Query Editor for SQL analysis, JupyterLab integrated developer environment (IDE), Visual ETL and workflows, and machine learning (ML) capabilities.

Try one-click onboarding and connect to Amazon SageMaker Unified Studio
To get started, go to the SageMaker console and choose the Get started button.

You will be prompted either to select an existing AWS Identity and Access Management (AWS IAM) role that has access to your data and compute, or to create a new role.

Choose Set up. It takes a few minutes to complete your environment. After this role is granted access, you’ll be taken to the SageMaker Unified Studio landing page where you will see the datasets that you have access to in AWS Glue Data Catalog as well as a variety of analytics and AI tools to work with.

This environment automatically creates the following serverless compute: Amazon Athena Spark, Amazon Athena SQL, AWS Glue Spark, and Amazon Managed Workflows for Apache Airflow (MWAA) serverless. This means you completely skip provisioning and can start working immediately with just-in-time compute resources, and it automatically scales back down when you finish, helping to save on costs.

You can also get started working on specific tables in Amazon Athena, Amazon Redshift, and Amazon S3 Tables. For example, you can select Query your data in Amazon SageMaker Unified Studio and then choose Get started in Amazon Athena console.

If you start from these consoles, you’ll connect directly to the Query Editor with the data that you were looking at already accessible, and your previous query context preserved. By using this context-aware routing, you can run queries immediately once inside the SageMaker Unified Studio without unnecessary navigation.

Getting started with notebooks with a built-in AI agent
Amazon SageMaker is introducing a new notebook experience that provides data and AI teams with a high-performance, serverless programming environment for analytics and ML jobs. The new notebook experience includes Amazon SageMaker Data Agent, a built-in AI agent that accelerates development by generating code and SQL statements from natural language prompts while guiding users through their tasks.

To start a new notebook, choose the Notebooks menu in the left navigation pane to run SQL queries, Python code, and natural language, and to discover, transform, analyze, visualize, and share insights on data. You can get started with sample data such as customer analytics and retail sales forecasting.

When you choose a sample project for customer usage analysis, you can open sample notebook to explore customer usage patterns and behaviors in a telecom dataset.

As I noted, the notebook includes a built-in AI agent that helps you interact with your data through natural language prompts. For example, you can start with data discovery using prompts like:

Show me some insights and visualizations on the customer churn dataset.

After you identify relevant tables, you can request specific analysis to generate Spark SQL. The AI agent creates step-by-step plans with initial code for data transformations and Python code for visualizations. If you see an error message while running the generated code, choose Fix with AI to get help resolving it. Here is a sample result:

For ML workflows, use specific prompts like:

Build an XGBoost classification model for churn prediction using the churn table, with purchase frequency, average transaction value, and days since last purchase as features.

This prompt receives structured responses including a step-by-step plan, data loading, feature engineering, and model training code using the SageMaker AI capabilities, and evaluation metrics. SageMaker Data Agent works best with specific prompts and is optimized for AWS data processing services including Athena for Apache Spark and SageMaker AI.

To learn more about new notebook experience, visit the Amazon SageMaker Unified Studio User Guide.

Now available
One-click onboarding and the new notebook experience in Amazon SageMaker Unified Studio are now available in US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Singapore), and Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland) Regions. To learn more, visit the SageMaker Unified Studio product page.

Give it a try in the SageMaker console and send feedback to AWS re:Post for SageMaker Unified Studio or through your usual AWS Support contacts.

Channy



from AWS News Blog https://ift.tt/HdZgpiy
via IFTTT

Build production-ready applications without infrastructure complexity using Amazon ECS Express Mode

Deploying containerized applications to production requires navigating hundreds of configuration parameters across load balancers, auto scaling policies, networking, and security groups. This overhead delays time to market and diverts focus from core application development.

Today, I’m excited to announce Amazon ECS Express Mode, a new capability from Amazon Elastic Container Service (Amazon ECS) that helps you launch highly available, scalable containerized applications with a single command. ECS Express Mode automates infrastructure setup including domains, networking, load balancing, and auto scaling through simplified APIs. This means you can focus on building applications while deploying with confidence using Amazon Web Services (AWS) best practices. Furthermore, when your applications evolve and require advanced features, you can seamlessly configure and access the full capabilities of the resources, including Amazon ECS.

You can get started with Amazon ECS Express Mode by navigating to the Amazon ECS console.

Amazon ECS Express Mode provides a simplified interface to the Amazon ECS service resource with new integrations for creating commonly used resources across AWS. ECS Express Mode automatically provisions and configures ECS clusters, task definitions, Application Load Balancers, auto scaling policies, and Amazon Route 53 domains from a single entry point.

Getting started with ECS Express Mode
Let me walk you through how to use Amazon ECS Express Mode. I’ll focus on the console experience, which provides the quickest way to deploy your containerized application.

For this example, I’m using a simple container image application running on Python with the Flask framework. Here’s the Dockerfile of my demo, which I have pushed to an Amazon Elastic Container Registry (Amazon ECR) repository:


# Build stage
FROM python:3.6-slim as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt gunicorn

# Runtime stage
FROM python:3.6-slim
WORKDIR /app
COPY --from=builder /root/.local /root/.local
COPY app.py .
ENV PATH=/root/.local/bin:$PATH
EXPOSE 80
CMD ["gunicorn", "--bind", "0.0.0.0:80", "app:app"]

On the Express Mode page, I choose Create. The interface is streamlined — I specify my container image URI from Amazon ECR, then select my task execution role and infrastructure role. If you don’t already have these roles, choose Create new role in the drop down to have one created for you from the AWS Identity and Access Management (IAM) managed policy.

If I want to customize the deployment, I can expand the Additional configurations section to define my cluster, container port, health check path, or environment variables.

In this section, I can also adjust CPU, memory, or scaling policies.

Setting up logs in Amazon CloudWatch Logs is something I always configure so I can troubleshoot my applications if needed. When I’m happy with the configurations, I choose Create.

After I choose Create, Express Mode automatically provisions a complete application stack, including an Amazon ECS service with AWS Fargate tasks, Application Load Balancer with health checks, auto scaling policies based on CPU utilization, security groups and networking configuration, and a custom domain with an AWS provided URL. I can also follow the progress in Timeline view on the Resources tab.

If I need to do a programmatic deployment, the same result can be achieved with a single AWS Command Line Interface (AWS CLI) command:

aws ecs create-express-gateway-service \
--image [ACCOUNT_ID].ecr.us-west-2.amazonaws.com/myapp:latest \
--execution-role-arn arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE] \
--infrastructure-role-arn arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]

After it’s complete, I can see my application URL in the console and access my running application immediately.

After the application is created, I can see the details by visiting the specified cluster, or the default cluster if I didn’t specify one, in the ECS service to monitor performance, view logs, and manage the deployment.

When I need to update my application with a new container version, I can return to the console, select my Express service, and choose Update. I can use the interface to specify a new image URI or adjust resource allocations.

Alternatively, I can use the AWS CLI for updates:

aws ecs update-express-gateway-service \
  --service-arn arn:aws:ecs:us-west-2:[ACCOUNT_ID]:service/[CLUSTER_NAME]/[APP_NAME] \
  --primary-container '{
    "image": "[IMAGE_URI]"
  }'

I find the entire experience reduces setup complexity while still giving me access to all the underlying resources when I need more advanced configurations.

Additional things to know
Here are additional things about ECS Express Mode:

  • Availability – ECS Express Mode is available in all AWS Regions at launch.
  • Infrastructure as Code support – You can use IaC tools such as AWS CloudFormation, AWS Cloud Development Kit (CDK), or Terraform to deploy your applications using Amazon ECS Express Mode.
  • Pricing – There is no additional charge to use Amazon ECS Express Mode. You pay for AWS resources created to launch and run your application.
  • Application Load Balancer sharing – The ALB created is automatically shared across up to 25 ECS services using host-header based listener rules. This helps distribute the cost of the ALB significantly.

Get started with Amazon ECS Express Mode through the Amazon ECS console. Learn more on the Amazon ECS documentation page.

Happy building!
Donnie



from AWS News Blog https://ift.tt/nKZMaWO
via IFTTT