Combining data science workstations and the cloud
Why do I need a data science workstation when I can just use the cloud?
Product may differ from images depicted.
Share on
How to truly optimise your workflows
In the rush to give data scientists a much-needed step up from the limitations of a standard PC, it’s tempting to jump straight from traditional laptops and desktops to cloud solutions. While a cloud solution can offer tremendous capacity and computing power when needed, it’s not always the fastest, easiest, most secure or most cost-effective option for developing data science models.
That’s why it’s best not to use the cloud as a standalone solution but instead in combination with local workstations specifically designed for data scientists.
If it were a question of using the cloud vs. using a typical office laptop or even desktop, cloud solutions would be the answer, but that’s not the case. A standard office computer is not the right local hardware. What’s needed is a workstation built for the unique needs of data scientists. While a data science workstation might look like a typical laptop or desktop at a glance, what sits under the hood is quite different and changes everything.
“Why do I need a data science workstation when I can just use the cloud?”
The local-vs.-cloud question has more than a simple either-or answer for data scientists, machine learning engineers and IT decisionmakers because cloud solutions alone don’t provide all that’s needed.
While cloud solutions offer tremendous power and capacity, a local data science workstation runs many data science applications more efficiently, more securely and without risk of unexpected costs—particularly when developing and testing models.
A data science workstation and the cloud are both vital tools and each play a valuable role with distinct functions. Picture the data science workstation as a truck and the cloud as a train. Both can haul large loads. The truck can do so with great flexibility, going almost anywhere. The train can move even more massive loads with excellent efficiency, but only when used at full capacity. With a workstation, as in a truck, you can own it and have it fixed with predictable costs to use however and wherever needed. With the cloud, as in trains, you pay as you go for space and need to schedule and budget carefully. And like with trucking and trains, workstations and the cloud need not be an either/or choice. Both should be used for respective strengths to provide critical benefits in their specific place.
When the data science workstation is the right tool
Consider the work you’re doing between modeling and analytics on the low end and machine learning on the high end. If you’re accustomed to sending most of your data science workloads to the cloud, you may not think that your choice of local computer matters. But suppose you could start shifting more of those workloads to a data science workstation. That would allow you to retain control, keep your dataset close by and experiment more.
The right workstation also offers latency advantages over the cloud when immediacy is important. One interesting example is the use of data science workstations to win auto races. While racing teams may use the cloud outside of races to mine huge data sets for a competitive advantage, during a race, connectivity issues and cloud latency delays of ~40sec can result in lost races. That’s when today’s savvy racing teams turn to onsite workstations. Right on the spot, they gather real-time data and modeling to advise when to make the next pit stop, replace tires, fuel up and more. When a split second can make the difference between victory and defeat, workstations are helping to drive championships.
When the cloud is the right tool
Depending on the capacities of your local workstation and the scale of your projects, you may need more resources than you can get locally. Your datasets may be too large, or you may be training with streaming data not suited to local computing. It may occasionally be the case that the desired threshold of model accuracy may require days or weeks of training—far more than you could devote locally. These are the times to turn to cloud solutions. The right workstation also offers latency advantages over the cloud when immediacy is important.
Which tool checks the boxes best?
There are more boxes to check than just speed and capacity to make data science workflows their best. Here’s a look at the respective strengths of data science workstations and a glimpse of how that links to cloud solutions.
Memory
Cloud solutions hypothetically offer unlimited memory, but it’s not always the single-node memory needed. “The only question about whether you can run the workload locally is, ‘Do I have enough single-node memory?’” says David Liu, Lead Data Science and AI Solutions Engineer at Intel. “When you understand what makes a data science workload tick and how to get performance out of it, you see that the system needs to be single node. That’s because certain tools used for things like data frame manipulation and even some basic machine learning algorithms used for statistics may not be supported across multiple nodes. If you’re going to prove that you made the model correctly, single-node memory is important.”
Data-science workstations must handle the weight of the data and the speed of the workload.
When using local computing, this is not an issue. When using cloud solutions, it is essential to make sure the memory is drawing from a single node. If working locally, it’s necessary to ensure that the workstation has sufficient memory capacity. Data science workstations must handle the weight of the data and the speed of the workload. The demands of data science projects are growing. While 8 to 15 GB of RAM used to be sufficient, now the average data science project requires 32 to 128 GB of RAM. And in cases where cloud solutions aren’t an option or the speed demands are higher—like those that need to quickly process petabytes of data—the demand for RAM may grow even more. And as data science continues to advance, that’s destined to be the case.
A workstation with sufficient memory frees you to compile and complete data analysis in almost any way you need without so much as a hiccup.
Storage
Storage wouldn’t appear to be an issue with the unlimited capacities of cloud computing, but accessing the stored data from the cloud is not always efficient.
The greater your local storage capacity, the easier it is to scale out your experimentation and modeling across hundreds of millions or billions of data points.
Maybe you can work with a 1 TB or 2 TB dataset over an internet connection, but the work will go much faster if the data is loaded locally. Having multiple terabytes of local storage also eliminates risk variables in transporting data. You incur risk every time you move sensitive data between a secure source and your off-site computer, but you mitigate that risk when you have enough local storage.
GPU
Graphics processing units (GPUs) are increasingly essential in data science, artificial intelligence (AI) and machine learning. GPUs are most helpful in processing large chunks of data in parallel and applying the same calculations to them repeatedly. In a typical data science workflow, that type of processing occurs during model training. GPUs are also helpful at the end of the data science workflow when deploying the trained model to production for inference.
“A full, GPU-accelerated AI stack is needed, and large GPU memory is a prerequisite to crunch large datasets required by AI algorithms as performance is often related to whether the data set fits in GPU memory.”
Andre Franklin
Senior Product Marketing Manager at NVIDIA, an HP Alliance Partner
Just as a CPU depends on task-specific software, a GPU depends on software for the extreme demands of data science and AI workloads. “A full, GPU-accelerated AI stack is needed, and large GPU memory is a prerequisite to crunch GPU large datasets required by AI algorithms,” writes Andre Franklin, Senior Product Marketing Manager at NVIDIA, an HP Alliance Partner. “Software breadth for GPU-accelerated platforms can allow developers to use a prebuilt software pipeline for specific tasks like computer vision, natural language processing and recommender systems.”
Whether running data science via the cloud or local workstation, you’ll want to make sure you’re selecting an option equipped with GPUs. You can find this from cloud resources set up for deep machine learning all the way to the portability of a GPU-equipped laptop like the HP ZBook Studio that lets you take in the GPU power of NVIDIA RTX™ graphics wherever you need to go.
The software stack
Those data science applications that lean heavily on the GPU also require a suitable complement of software tools, including PyTorch, TensorFlow, Keras and RAPIDS.
Specific cloud solutions and select data-science workstations—like those designed specifically for data science by Z by HP—come with this sort of software preloaded. While it may not seem like a significant issue, this can seriously lighten the load for data scientists, enabling them to stick to their area of expertise—gathering data and turning it into valuable insights—rather than dealing with confusion over software versions and updates.
Serious data science needs a Linux environment, while everyday functions like email and web conferencing are better in Windows. With WSL 2, both Windows and Linux can be used on the same workstation, complementing the workflow advantages of a preloaded software stack as well as saving time and desk space.
Security and mobility
Security is vital in every area of computing today, but especially so in data science. Big data is a high-value target often containing sensitive information and valuable intellectual property. Cloud computing requires sending sensitive information across the internet, so the system’s security is only as strong as its weakest link. That could be the endpoint device itself, the cloud connection, a Wi-Fi router or any number of other potential points of exposure. A workstation that keeps data native is less exposed than a public or even private cloud option.
A portable data science workstation, like the HP ZBook Studio, enables data scientists to work securely wherever they are without exposure to weak security links. And what’s more, with remote desktop systems like those built into HP workstations, only pixels are sent beyond the workstation, meaning that sensitive intellectual property always remains in a secure environment, even when the data scientist is viewing and manipulating it remotely.
“For me,” says Intel’s Liu, “from the data science perspective, the best approach is to have an exact copy of my workstation’s hardware in some virtualised place. Then I can access the system directly from anywhere.”
It’s also important that the virtual image runs a full desktop OS—be that Linux or Windows—and not just a command line. Engineers, developers and high-performance computing scientists may work from the command line, but most data scientists come from different backgrounds and need a complete user interface for their tools. That’s one way to bring ubiquity to data science: by taking your local configuration and having versions of it in the cloud that you access with virtualised workstation software.
Cost control
Data science offers a tremendous potential payoff, but that doesn’t mean organisations are simply handing data scientists and IT departments a blank check.
Workstations represent a fixed cost, while public cloud options follow pay-per-use models. IT departments can estimate the cost of the hardware lifecycle with workstations, while cloud usage can vary significantly throughout any given project and create budgeting confusion.
The cost of experimenting
Data science experiments offer a case in point for the budgeting issue. Like all development, data science centers on the trial-and-error loop. That means you can try and fail as much as you like without the fear of committing errors. The less dependent you are on external resources, the less time and money your mistakes cost you.
That is particularly true in the context of the cloud. Leasing cloud resources may have a reputation for being less expensive than purchasing computing hardware, but it is not free. And it does add up.
How bold will your approach be when you’re being charged for mistakes? Not as bold as it would be if you were working locally.
“Suppose you’re using the cloud for all your data science work,” says Lenny Isler, Business Development Manager for data science and AI at HP. “And suppose that it costs you $12 per hour. How bold will your approach be when you’re being charged for mistakes? Not as bold as it would be if you were working locally. Your local computer represents a fixed cost, meaning that a local computing mistake costs only time, not money. It doesn’t cost you money every time you try something new.”
When you know that mistakes cost time but not more money, you’ll be more willing to hazard new ideas and innovations. In data science, that’s a significant advantage at every point where experimentation is essential.
Finding the right balance
Are you making the most of how you divide your work between local computing resources—like a company-issued laptop computer or a high-performance data science workstation—and a public or private cloud? What are your guidelines for the tasks you perform locally and those you send to the cloud?
Using the right tool for the job is key to any task, and data science is no exception.
By understanding what makes an effective data science workstation or cloud tool you can maximise results and avoid challenges like security exposure, cost overruns and long waits.
And by partitioning tasks to fit the strengths of workstations and cloud to play to the strengths of each—instead of trying to use one or the other as a one-size-fits-all solution—you’ll up efficiency across the board.
The true data science workstation
Not all workstation-level computers are designed for data science. Z by HP, the advanced computing division of HP, has gone the extra mile in creating workstations from nimble laptops to high-performance desktops shaped by data scientists for their needs.
Find out how the Z by HP lineup can help you move fluidly between local and cloud computing from the light and thin ZBook Studio to the mobile and powerful ZBook Fury to the rack mounted Z4 R and the monster of the lineup, the Z8 desktop. Z by HP data science workstations are configured out of the box to maximise performance, memory, storage and GPU parallel processing with NVIDIA RTX graphics. They combine data science hardware capabilities with the preloaded Z by HP Data Science Software Stack, available on Ubuntu and Windows with WSL 2.1 The preconfigured software stack includes TensorFlow, Keras, PyTorch, Git, Visual Studio Code, PyCharm and RAPIDS. Shaped by input from data scientists, these specially configured Z by HP workstations assist in developing and testing models, offer the tools needed to achieve enterprise agility and come with the performance to deliver competitive advantage.
Z by HP for Data Scientists & Analysts
Get rapid results from your most demanding datasets, train models and create visualisations with Z by HP data science laptop and desktop workstations.
Exceptional performance
with Intel® Xeon®
and Intel® Core™ i9 processors.
Meet the Products
Have a Question?
Contact Sales Support.