6 Best Practices for Building a Sustainable Cloud Data Lake

Impetus Technologies
4 min readApr 24, 2020

--

By Sulagna Ganguly

Enterprises have huge amounts of unstructured data, which, if mined, can provide valuable insights. To tap the potential of these data, enterprises are switching from traditional warehousing to adaptive data lakes. The shift comes with changing needs and practices for how that storage should best be constructed. From our experience, we have identified six core components of a reliable, secure, sustainable, and flexible cloud data lake.

1. Start with an enterprise foundation

A cloud data lake is designed to support multiple data types to empower your analytics and business intelligence units across the organization. Use cases differ across organizations and with time. To support the growth and potential for a data lake, companies must start with a robust foundation. Before collection and management, it is beneficial to review security needs and challenges, develop appropriate reusable templates, and put access and financial controls into place.

While there are standard data warehousing best practices, enterprises need to have a holistic view to leverage the data lake. For example, budgeting and ROI need to consider all units that might benefit from the data, goal alignment should be cross-organization, and overall design should be built to add more users and data over time.

2. Understand compliance ahead of implementation

While much of the tech world focuses on failing fast in early efforts, a robust enterprise data lake needs a more stable and secure foundation to get to the fail-fast stage. So, understanding the compliance needs and requirements you face ahead of data collection and use can significantly speed up your ability to offer products and support using the data lake. Newer regulations, such as the California Consumer Privacy Act (CCPA), are changing how data is to be stored, used, and managed — including when consumers can tell you to remove it all.

Looking at such regulations and determining how to comply with not only specific requirements but the spirit of the law can help you construct a Hadoop data lake that is malleable enough to remove data or make changes in access without broader data loss or harm.

Sustainability is mostly a question of flexibility. Architecture that is adaptive and created to maintain control through change — especially regulatory change — will help you build an enterprise data lake that is sustainable and fruitful.

3. Integrate DevOps clearly

DevOps is a core component of keeping your data lake healthy. Protect your investment by putting together clear guidelines for data collection, management, and access. Then, put practices into place to check and ensure your guidance is always followed.

Preventative measures, such as establishing trustworthy sources or limiting the collection of low-value information that comes with higher regulatory burdens, will protect you across applications.

DevOps is a cultural focus that typically helps improve the delivery pipeline of products and services, making it easier to compete. For your data lake, it can be more inward-facing and have the feedback loop focus on security and access, speeding up your time to adopt new applications of data while protecting your lake.

4. Plan for expansion

Ultimately, a cloud data lake should grow together with your business, not limiting your capabilities or coming up against data it can’t access and integrate. To achieve scalability, your data lake must be engineered for expansion.

Not only do you want standard integration capabilities but look for workable APIs built on industry standards and create a framework for ensuring compliance capabilities. The goal is to be able to adjust or make minor changes to your enterprise data lake to be compliant as soon as you’re ready to expand to a new country or industry.

5. Keep an eye on cost

The enterprise focus for your foundation and early development must translate into a cost-conscious approach for the rest of usage for a Hadoop data lake. Go beyond tracking the hard and soft costs of your lake and look for ways that you can leverage a higher ROI in its development or application.

For instance, you can use existing systems like the Impetus Data Lake to generate better time-to-value by putting all capabilities — metadata management, governance, ETL, quality assurance checks, audits, and business intelligence generation — together. Sometimes, it’s as simple as ensuring you have plug-and-play support for new data feeds.

Working in a single platform and solution can help you control costs and generate a positive return more quickly, while also maintaining the security and tooling you need.

6. Choosing the right tools

Tool selection is one of the most critical choices you’ll make. Many customers have an affinity for traditional tools, which may not work on the cloud. We recommend an in-depth review of capabilities and gaps to protect your efforts. This will also help you avoid the trap of becoming too excited about native cloud services, many of which have their gaps or might not be a strong fit for a transition.

Take time to assess your requirements and bring in the right tools. Consider your available off-the-shelf options as well as what can be built custom.

Review the criteria that they have for functionality as well as long-term business impacts. Avoid lock-ins when possible if there’s not a clear roadmap to support you. Ask partners about how data is shared and accessible or where you might have leverage with its use. Push for access and interoperability and be wary of any proprietary data formats.

Build these best practices into your selection efforts and change management regimen. It’s a complicated situation, but taking your time and focusing on reliability and access can help protect you. If you want to know more about best practices, download our e-book or watch our recent cloud data lake webinar.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Impetus Technologies
Impetus Technologies

Written by Impetus Technologies

Impetus is focused on creating powerful enterprises through deep data awareness, data integration, and advanced data analytics. https://bit.ly/38pelOr

Responses (1)

Write a response