Migrating from an on-premises Hadoop solution to Google Cloud requires a shift in approach. A typical on-premises Hadoop system consists of a monolithic cluster that supports many workloads, often across multiple business areas. As a result, the system becomes more complex over time and can require administrators to make compromises to get everything working in the monolithic cluster. When you move your Hadoop system to Google Cloud, you can reduce the administrative complexity. However, to get that simplification and to get the most efficient processing in Google Cloud with the minimal cost, you need to rethink how to structure your data and jobs.

Because Dataproc runs Hadoop on Google Cloud, using a persistent Dataproc cluster to replicate your on-premises setup might seem like the easiest solution. However, there are some limitations to that approach:

  • Keeping your data in a persistent HDFS cluster using Dataproc is more expensive than storing your data in Cloud Storage, which is what we recommend, as explained later. Keeping data in an HDFS cluster also limits your ability to use your data with other Google Cloud products.
  • Augmenting or replacing some of your open-source-based tools with other related Google Cloud services can be more efficient or economical for particular use cases.
  • Using a single, persistent Dataproc cluster for your jobs is more difficult to manage than shifting to targeted clusters that serve individual jobs or job areas.

The most cost-effective and flexible way to migrate your Hadoop system to Google Cloud is to shift away from thinking in terms of large, multi-purpose, persistent clusters and instead think about small, short-lived clusters that are designed to run specific jobs. You store your data in Cloud Storage to support multiple, temporary processing clusters. This model is often called the ephemeral model, because the clusters you use for processing jobs are allocated as needed and are released as jobs finish.

The following diagram shows a hypothetical migration from an on-premises system to an ephemeral model on Google Cloud.

Diagram illustrating how on-premises clusters can be rearranged when migrating to Google Cloud.

The example moves four jobs that run on two on-premises clusters to Dataproc. The ephemeral clusters that are used to run the jobs in Google Cloud are defined to maximize efficiency for individual jobs. The first two jobs use the same cluster, while the third and fourth jobs each run on their own cluster. When you migrate your own jobs, you can customize and optimize clusters for individual jobs or for groups of jobs as makes sense for your specific work. Dataproc helps you quickly define multiple clusters, bring them online, and scale them to suit your needs.

The data in the example is moved from two on-premises HDFS clusters to Cloud Storage buckets. The data in the first cluster is divided among multiple buckets, and the second cluster is moved to a single bucket. You can customize the structure of your data in Cloud Storage to suit the needs of your applications and your business.

The example migration captures the beginning and ending states of a complete migration to Google Cloud. This implies a single step, but you'll get the best results if you don't think of moving to Google Cloud as a one-time, complete migration. Instead, think of it as refactoring your solutions to use a new set of tools in ways that weren't possible on-premises. To make such a refactoring work, we recommend migrating incrementally.

Here are the recommended steps for migrating your workflows to Google Cloud:

  1. Move your data first
    • Move your data into Cloud Storage buckets.
    • Start small. Use backup or archived data to minimize the impact to your existing Hadoop system.
  2. Experiment
    • Use a subset of data to test and experiment. Make a small-scale proof of concept for each of your jobs.
    • Try new approaches to working with your data.
    • Adjust to Google Cloud and cloud-computing paradigms.
  3. Think in terms of specialized, ephemeral clusters.
    • Use the smallest clusters you can—scope them to single jobs or small groups of closely related jobs.
    • Create clusters each time you need them for a job and delete them when you're done.
  4. Use Google Cloud tools wherever appropriate.