Jump to content
The Great Escape Online Community

[Slashdot] - UC Berkeley Launches SkyPilot To Help Navigate Soaring Cloud Costs


Recommended Posts

Researchers at U.C. Berkeley's Sky Computing Lab have launched SkyPilot, an open source framework for running ML and Data Science batch jobs on any cloud, or multiple clouds, with a single cloud-agnostic interface. Datanami reports: SkyPilot uses an algorithm to determine which cloud zone or service provider is the most cost-effective for a given project. The program considers a workload's resource requirements (whether it needs CPUs, GPUs, or TPUs) and then automatically determines which locations (zone/region/cloud) have available compute resources to complete the job before sending it to the least expensive option to execute. The solution automates some of the more challenging aspects of running workloads on the cloud. SkyPilot's makers say the program can reliably provision a cluster with automatic failover to other locations if capacity or quota errors occur, it can sync user code and files from local or cloud buckets to the cluster, and it can manage job queueing and execution. The researchers claim this comes with substantially reduced costs, sometimes by more than 3x. SkyPilot developer and postdoctoral researcher Zongheng Yang said in a blog post that the growing trend of multi-cloud and multi-region strategies led the team to build SkyPilot, calling it an "intercloud broker." He notes that organizations are strategically choosing a multi-cloud approach for higher reliability, avoiding cloud vendor lock-in, and stronger negotiation leverage, to name a few reasons. To save costs, SkyPilot leverages the large price differences between cloud providers for similar hardware resources. Yang gives the example of Nvidia A100 GPUs, and how Azure currently offers the cheapest A100 instances, but Google Cloud and AWS charge a premium of 8% and 20% for the same computing power. For CPUs, some price differences can be over 50%. [...] The project has been under active development for over a year in Berkeley's Sky Computing Lab, according to Yang, and is being used by more than 10 organizations for use cases including GPU/TPU model training, distributed hyperparameter turning, and batch jobs on CPU spot instances. Yang says users are reporting benefits including reliable provisioning of GPU instances, queueing multiple jobs on a cluster, and concurrently running hundreds of hyperparameter trials.

twitter_icon_large.png facebook_icon_large.png

Read more of this story at Slashdot.

View the full article

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Create New...

Important Information

By using The Great Escaped Online Community, you agree to our Privacy Policy and Terms of Use