Tech Decisions
May 23, 2024

Episode 4 | Gopinath Balakrishnan on Important Architectural Decisions in Cloud Migrations

In this episode, Gopinath Balakrishnan, a cloud architect at Google, discusses the intricacies of hybrid and multi-cloud connectivity, emphasizing the migration challenges that companies face.

Summary

In this episode, Gopinath Balakrishnan, a cloud architect at Google, discusses the intricacies of hybrid and multi-cloud connectivity, emphasizing the migration challenges that companies face, including the loss of control over physical layers and the need for robust data center management. He explores the trade-offs and motivations behind enterprises transitioning to public cloud environments, highlighting the advantages for born-in-the-cloud companies and the hurdles for traditional enterprises in modernizing their infrastructure. The conversation delves into the importance of networking and regional considerations in cloud computing, stressing the need for agility while maintaining control over data centers. Gopi underscores the significance of setting realistic service level objectives (SLOs) and the necessity for frequent architecture reviews every two to three years to ensure tech depth and operational efficiency. He points out the critical role of end-to-end observability in managing service level agreements (SLAs) across various cloud providers. Finally, Gopi advocates for a focus on application-level metrics and maintaining a transparent partnership with cloud service providers to address business challenges effectively.

Key Takeaways

  1. Cloud Infrastructure and Multi-cloud Environments: Enterprises are adopting hybrid and multi-cloud environments to leverage scalability and better economics, but they face challenges like loss of control over physical layers and complexity in application migration. Constant updates and regional optimizations are essential for maintaining efficiency and reducing latency.
  2. Networking and Security Expertise: Gopinath Balakrishnan's extensive experience in network security, data center technologies, and cloud connectivity at Google highlights the importance of robust networking solutions in ensuring smooth transitions to cloud environments, crucial for managing data flow and maintaining consistent user experiences across different platforms.
  3. Business-Centric Decision-Making: Understanding core industry-specific problems is vital in making informed decisions about cloud adoption. Regular architecture reviews that focus on the business and application layers can help organizations align their technology stack with their strategic goals, avoiding technology lock-in and enhancing operational efficiency.
  4. Collaboration and Co-Creation: Partnering with technology providers and collaborating with digital-native companies can help address complex needs through co-created solutions. These partnerships are key to modernizing infrastructure, optimizing applications, and maintaining service level objectives (SLOs) across diverse cloud environments.
  5. Continuous Review and Modernization: The pace of technological innovation necessitates frequent reviews and updates to the architectural and infrastructure stack. Companies must balance the need for modernization with the flexibility to adapt to new technologies, such as AI, to stay competitive and meet evolving business demands.

Resources

Transcript

Tech Decisions Episode 4

===

Louis-Victor Jadavji: [00:00:00] Welcome to the Tech Decision Podcast, where we explore the world of enterprise technology, break down complex choices and strategic insights. Each episode dives deep into tech solutions to help you make informed decisions for your organization. Whether you're a seasoned IT professional or a curious newcomer, tech decision is your go to resource.

Tech decision is sponsored and hosted by the team at Taloflow, the leading technology selection platform to evaluate vendors for your exact use case. I'm joined by my co host Abhishek Singh, manager of enterprise systems integration at Toast, and former principal analyst at Gartner, and also have joined by my co founder, Todd Kesselman, who has some insights he'd like to share on the call today as well.

Our guest today is Gopinath Balakrishnan, also known as Gopi. Gopi is a cloud architect at Google, where he works with [00:01:00] strategic accounts to architect scalable, performant, and secure hybrid multi cloud application infrastructure. With deep expertise in hybrid and multi cloud connectivity, network security, data center technologies, and more, Gopi brings a wealth of knowledge to our discussion today.

Thank you all for being here

Abhishek Singh: Can you tell us about your background and can you tell us about your experiences with cloud infrastructure?

Gopinath Balakrishnan: Yeah. First of all, thank you so much for having me and my pleasure to share some of my experience working with companies in the Valley. I started my career working with com, computerization of banks in India. We're using all HP or Unix systems with dump terminals and moved on to building OEM networking solutions for mobility and service providing networks and data center networking. And then here at Google, I work with some of these series B to [00:02:00] D growth stage startups at the same time working with large enterprises and help them navigate different cloud technology and solutions and make an informed decision for their business needs and also help them build and operate and manage their applications at global scale.

So I'm really thankful to having so many leaders from different companies to work closely with and understand their business problems. and ability to solve and work together and co create a solution that addresses that. Of course there's no one size fits all and different industries needs different use cases and different solutions and approaches.

So it's very interesting to see working with all digital native companies, which are born in the cloud and a complete contrast problems and culture and mindset when you work with [00:03:00] large enterprises because of its nature of business, volume and scale at which they operate and the added complexity that comes with, right?

The only two caveats in this whole thing, I would say one, it's my experience working with companies like this, every industry and use case could be different. So staying focused on what the business problems and core industry that you are in is important. Second, I am not representing Google here so the experience that I share are just purely off my own opinion, not of Google.

Abhishek Singh: So you recently published a blog on packet pushes and you discuss some of the key considerations around enterprises moving towards public cloud and you and your own experience might've seen.

Or might have helped companies move towards a public cloud environment. So what motivated you to write about this topic? Because I think [00:04:00] it's most of the companies are calling the lift and shift model, other kind of model to move towards cloud. So what motivated you to write about this topic, educate the industry about this topic?

Gopinath Balakrishnan: Yeah, it's actually interesting. It started as just writing some notes in my phone. During some winter time off in 2023 last year, the main motivation, if you look back, and I think it's no surprise that more and more enterprises are adopting to this hybrid multi cloud infrastructure to build, deploy the application and deliver for global audience and users.

I've noticed similar patterns of challenges and approaches regardless of their business and core applications, right? So it just started as two bullet points and I expanded writing up much more details and the core reason for that is [00:05:00] because although there are a lot more podcasts and a lot more documentation, it's a lot of experts who have shared what are the best strategy to have hybrid multi cloud environments are. Often the practical challenges to adopting cloud, especially if you're an enterprise who has large on premises environment and trying to expand that to hybrid multi cloud environment, the challenge becomes As an IT leader, they have to provide the same consistent user experience regardless of where their backend and applications are running and provide the same reliability, scalability to the platform, and that adds more complexity and at the scale that they need to operate.

So this is the starting point and then, obviously I went in detailing out some of my [00:06:00] experience not necessarily covering everything 101 on different approaches that can solve some of these challenges. I'm sure there's more to cover, but I only cover a few topics where I think that I found it to be much more closer to, companies should look at it, not necessarily ignore the importance of those practical challenges.

As they make their decision towards adopting, hybrid multi cloud environment for their company and business, because end of the day plays some crucial to their operational overhead and also making sure that infrastructure that they actually run provides them best benefits of the unit economics.

Louis-Victor Jadavji: And speaking of which I think one of the points you raised in that article was that the actual cost of lifting and shifting applications To the cloud without refactoring, like how can you elaborate [00:07:00] on this challenge and what enterprises should keep in mind? Yeah if they're doing so without a refactor.

Gopinath Balakrishnan: Yeah. It's on one hand and actually, to be honest, like it's really amazing. I can see this huge enterprise applications running, serving like millions of users globally. Running on this hybrid multi cloud environment. On one hand, enterprises have a full control on their data center enrollment right from the physical layer all the way up in the application layer.

On the other hand, when they move their applications to the other public cloud infrastructure, they not only are losing control on physical layer, but also to some extent, let go off their visibility. Right, but it just powers massive application, then running very successfully. Even though these same application stack is running in a distributed fashion across these hybrid [00:08:00] multi cloud networks, right?

It is amazing. On the other hand, there are companies who are born in the cloud or very cloud native. They're able to take a good advantage of the cloud that can offer at scale and provide better unit economics at scale. So on one hand, companies are dealt with, have to deal with I have to get to the cloud faster to get access to these newer technology and solutions and also provide much better business value to their investors and the users.

On the other hand, they have to also think about modernizing or even optimizing their infrastructure, to get to the cloud. So it's more like a catch 22 problem at times. I'm seeing a lot more adoption in the right direction where, do a lift and shift approach, get to the cloud, get access to these tools and technologies, and [00:09:00] ship features and, products to the customers.

And also come up with a strategy to optimize the right size and then move to a modernization later, right? So the challenge becomes the cloud and other public service providers have to meet where enterprise apps are today. Not necessarily modernized and not everyone is. Cloud Titan to be completely distributed with a modern application stack, and it's not easy at all.

So cloud needs to meet where customer and enterprise applications are today and provide it. That level of end to end visibility as enterprises moving to the lift and shift and splitting their environment across different environment.

Louis-Victor Jadavji: And, but I think in the article, you made some specific references to the networking challenges.

Gopinath Balakrishnan: Yeah. Like I said, I think there are multiple layers of problem [00:10:00] and, but networking also is one of the key aspects. I think. When you come from the data center networks, you have full visibility, you have a lot of overprovisioned network capacity that can absorb needs even on peak demands you're overprovisioning for.

When it comes to cloud, and this is again, a large multi tenant hosting multiple customers and sharing that infrastructure. So that's a fundamental, there's no rocket science here, right? Understanding the user, where you are serving, and different geolocation, and trying to come up with the geo deployment regional selection becomes critically important, not only to provide a better user experience and faster downtrip for your user and applications.

At the same time, reducing the end to end application latency is critical, right? And in addition to that when everything runs [00:11:00] perfectly fine, you have chosen few deployment regions and you're running it fine. Oftentimes, what gets ignored is the ability to run these regional applications self sufficient within that region.

Because you can't take an application with a three tier architecture and split them into three different regions. And you're allowing that traffic from one user traffic flow journey to go over multiple regions. Which increases your chances of network traffic, round trip time and latency, and failure of one region will pretty much impact every other customer use cases.

Trying to come up with more optimal regional selection, being self sufficient, including disaster recovery situation, is critical from a networking point of view.

Abhishek Singh: So there are many enterprises that kind of want the agility of the cloud, [00:12:00] but they don't want to give away the control of their own data centers. So is this something that's realistic that they are expecting? And if. What are the trade offs involved in such kind of situations?

Gopinath Balakrishnan: Yeah, the, there is no simple answer to this question.

On one hand, as an IT leaders they want to be agile and try to build these applications across multiple, environment. But when it comes to cloud, it becomes a black box, to be honest, to a certain extent, especially at the lower layers. And it's no surprise that, clouds are massive infrastructure built and managed by a huge amount of operational and SRE teams who constantly update and upgrade the lower systems and keep those infrastructure available all the time, for very good reasons, because you want that benefit [00:13:00] from the cloud, but at the same time know that you have to let go of such low layer controls and focus more on higher layer application layer and business layer metrics.

So that's a trade off you have to come to understand because if you keep holding both, then it becomes a very tight situation. On the other hand the clouds as a public cloud offering and different hybrid environment. As enterprise IT leaders. trying to provide the consistent user experience and better user experience with better performance at scale.

They need to have end to end visibility because their infrastructure is beyond their data centers now. There has to be a right balance between what enterprises have to give up at a lower layer and focus on building SLOs and [00:14:00] building metrics at the application and business layer. At the same time, they need to be provided with an end to end visibility when there are impacts to their application traffic tools, if that makes sense.

Todd Kesselman: I'm curious when I hear refactor and I hear the infra versus cloud I also hear turf war and politics given our experience. So I'm just curious what the organizations that have done it how do they refactor their organization to do this? Or do they refactor their organization? What changes do you see to implement what you're suggesting?

Gopinath Balakrishnan: Yeah, so refactoring, especially, I've seen both ends, right? There are companies, I've worked with some API platform company, which completely modernized their app stack in the matter of three to four months time. And they saw the value in doing that. It was still a Series D growth stage startup, [00:15:00] and they were able to do it from the legacy infrastructure applications stack, they were able to completely modernize it using some of the Kubernetes and microservices architecture on the other hand I am seeing a large enterprise, they have to get to cloud and they have a time frame to modernize it, but they need to come to cloud faster to focus on where their engineering and effort resources can be better utilized on building new features and shipping features and solutions to the customers.

So there, there is no one right answer and depending on what company size that you're looking at and industries that they are working on, there has to be a balance between there. I only think about how to best do that is actually frequently come up with open, transparent partnership and co creating [00:16:00] with what's a business value you can get and what is the business problem that you're trying to solve with the different cloud service provider becomes much more crucial in this space. So rather than trying to take a technology savvy or technical angle, try to focus on what are the business problems we want to get out and solve and how do I partner with the cloud service provider to co create that solutions that can solve the problem. Does that answer your question, Todd?

Todd Kesselman: In some respect. So you're saying the partner cuts through some of the intra politics? Is because I was asking, within the organization, you're making dramatic change, and you're suggesting changes that, it's not just attaching the cloud stuff on the side. You're suggesting you have to go back through and look through what everybody's doing, And maybe modernize that. So I'm just saying that's organizationally, that can be difficult.

Gopinath Balakrishnan: I completely agree. Again, cultural aspect. I think that is [00:17:00] something that it's much harder and and it's interesting you bring this up as well.

The company that I worked with, and especially in Silicon Valley, there's more build first mindset and culture as opposed to buy or partner strategy. So I think the leaders that I work closely with, they understand the problem and they're the ones who are actually driving down the decision of, how to influence and better position their engineering and resources because there's so much to build. As a leader they have to prioritize where to invest their resources and people to focus on building newer solution and let go of some of the legacy. I completely agree that it's a complicated situation, but often I find that the leaders that I work with are very good at understanding the pain point and making sure that they actually work with what they can leverage [00:18:00] from partnering with a technology partner, as opposed to what they can focus on building themselves.

Louis-Victor Jadavji: So I think in my conversations with you and I think you've emphasized a lot the importance of regularly reviewing these architectural choices in the process and in the cloud. And, what is the right cadence for enterprises when conducting these reviews of these architectural choices?

What do you see work? What do you see as too frequent or impractical? And Yes, so some color on that would help. I think the audience understand how they should build these practices in.

Gopinath Balakrishnan: Yeah, I think before answering the question of how frequently someone should do an architectural review, I want to quickly point out, why should you do those architectural reviews, like what's the business benefit, right? So the fundamental question that you need to figure out is your [00:19:00] current application or infrastructure stack is allows you to ship and build features faster. Do you have access to the latest and greatest solution and tools? Are you able to achieve the economies of scale with current infrastructure with a better margin?

and will your current technology stack or architecture support both current and next two to three years growth projection that your company has. If you can do that quickly, not necessarily get into the architecture review with a partner or customer or with a cloud, and if you can answer these critical questions, then prioritize, go after big areas where you think this will help accelerate your company's growth and also solve the problems for you for the next two to three years.

Then usually what I've seen anywhere from two [00:20:00] years to three years. And a time frame where I ask customers these critical questions, then that allows them to understand, okay, maybe there's an area where we need to focus on and prioritize. And that's when going into architecture review is first, help them provide visibility into what are the tech debt that we have, what are the operational efficiencies we can improve, and what are the new tools and solutions they can adopt to simplify the problem.

The answer is somewhere between two to three years, but more importantly, why do you need to do this? I've had that answer before.

Louis-Victor Jadavji: Of course. And but why the 2 to 3 years? Why not every 6 months? Why not every 5 years? What's special about that 2 to 3 year window? Is it that the pace of innovation on the cloud kind of has that sort of cycle where, there's new technologies you have to evaluate at that time?

What factors into [00:21:00] the 2 to 3 year window?

Gopinath Balakrishnan: Absolutely. I think the pace of innovation and the every two to three years is where I see some huge difference in solutions being offered at every different infrastructure provider layer, right? And also the innovations on, especially with now AI, there's much more important.

Of course, there's new solutions are being put. Two years from the cloud is offering and changing some of the new services. And taking that into consideration two years, and then another six months for you to take or the third year where you do the architecture review so that you are actually making some different new set of data sets and solution being offered to you into consideration, not necessarily simply going back on two years before and considering the same set of tools at that time is not going to be helpful.

Louis-Victor Jadavji: Got it. So that window is, it's really externally driven, right? What are the changes in the [00:22:00] offerings in the market? And you mentioned gen AI. So the pace of innovation there and new services popping up is arguably much faster than what we've seen in other years before this wave.

Could you argue that now architectural reviews have to be done more frequently or on an urgency basis due to the rapid pace of innovation in gen AI and how companies have to reposition themselves due to it? Are we at one year now, for example?

Gopinath Balakrishnan: I would think this fundamental architecture reviews that I am focusing on the core infrastructure and then the application infrastructure that hosted, but AI infrastructure and specific to that is a completely different angle to it.

I think it could be 6 months. It could be even smaller window, because I think I see that, in that space is rapidly growing and changing landscape every 6 months. You're right.

I think my answer and recommendation was more on the core fundamental infrastructure and [00:23:00] application angle. That answers your question.

Louis-Victor Jadavji: God, it makes sense. So in this respect, we would look at the architectural review cadence in a more of a decoupled way. So I understand. I would think so. Okay.

Gopinath Balakrishnan: Todd, would you agree with that?

Sure,

Todd Kesselman: I think there's the core structure, right? And then there's the bells and whistles, if you will, all the stuff on the side. I think the generative AI right now is the stuff on the side. We'll see what happens if it completely changes the stack, which it might, depending on The data requirements and the compute requirements, it might change the stack, but I'm not even sure where we're clear where that's going to end up yet.

I'm not sure I would completely change my stack till I knew what my application was and what the demands were on it. I think I would, of course.

Louis-Victor Jadavji: Okay.

Abhishek Singh: And I think that there's a

Gopinath Balakrishnan: lot of happening.

Todd Kesselman: But I was going to say what that made me think of is, it's just. [00:24:00] It's interesting because when you talk to people who are in the middle of making the decision they have a hard time making the trade off between extensibility and, technical lock versus flexibility.

And and so if you're going to review every 3 years, but you've. Got it completely locked in stack. I'm not sure other than being very frustrated, what you're really going to do about it. So it's just interesting how people make the decision on the front end when they're deciding on the stack in the first place.

They probably don't weigh those things as much as they should.

Abhishek Singh: So does that mean we should be going ahead with continuous evaluation and re evaluation of the existing stack every six months, every year, just to prevent technology lock in?

Gopinath Balakrishnan: No, like I said, six months, even two years to be honest with you, it's too much of an ask, trying to get a full blown architectural view and making choice.

Yeah, you never know, you may not even act on it. Having some meaningful [00:25:00] approach that best meets your business needs, and two to three years is reasonable enough to even get an idea and identify the one or two critical items. There may be like, you may end up with ten improvements, but what are the top one or two items that you can go ahead and implement it in the next one year or so, knowing that this is going to actually offer and solve some of the business challenges.

That you are facing today and possibly you will run into in the next 6 months or 1 year considering some of the growth project, right? So definitely weigh in based on that and do that. 6 months is definitely too much in my opinion. If you think decouple the AI and other areas and infrastructure, which is actually we are in the very beginning of those spaces.

And that might require complete drastic and rapid decision and discussions around how do you approach that it's a slightly different area of space that we need to take that into consideration. Yeah, [00:26:00] sorry.

Abhishek Singh: Go ahead, you go ahead. I insist, so I think I had a different question around the and you test upon it earlier so I think most of the enterprises.

Kind of focus on application latency, SLOs and threshold criticality. So what is the importance of these metrics and how should they collaborate with the cloud providers to set up realistic SLOs and build a successful partnership with the cloud vendors or cloud providers in general?

Gopinath Balakrishnan: Yeah, I, this is something that I've often dealt with.

I think when you. I work with enterprise who has a lot of enterprise applications running in on prem data center. They have severe, they pretty much have, stringent SLOs and SLAs, both from latencies and different metrics at a lower layer. But as you expand [00:27:00] that application across multiple other, beyond your network, you're really on connectivity to the cloud and then.

Within the cloud, there's a massive network infrastructure that connects various components across multiple regions. And then you also have connectivity to other clouds. So it's, you have infrastructure all over the place. Trying to go with an approach of, okay, this is what I observe in my data center and trying to get the same thing.

Because end of the day, your DC or data center fabric is not exactly the same thing that runs across, the public cloud infrastructure, which is different scale wise. It is huge, right? You're now thinking about 100x scale in terms of the infrastructure footprint and connectivity that I don't think any enterprise can today build that level of global connectivity and infrastructure.

It's important to focus more on revisiting your business metrics [00:28:00] and understand, okay, is this SLOs that I have to meet and how do I understand and work with a cloud provider and try to co create what are the best regional selections I should do, what are my disaster recovery and availability requirements that I must meet, and the cloud offers some SLOs to meet that.

Having a joint review and making sure that you can meet your internal SLOs and application needs and building that SLA with a cloud is a much more fruitful conversation as opposed to this is what we have and I need to ensure the same level of SLOs and SLAs across different hybrid multi cloud becomes little more tougher conversation.

And that's where the challenge I mentioned is also cloud needs to offer that end to end visibility, not just on connecting to the cloud. And sometimes customers find they have [00:29:00] not much visibility into the cloud within the cloud, right? And now customers are also relying heavily on the Internet.

So the enterprise needs to have an end to end visibility on their on prem connecting to cloud. And then from cloud applications to the user on the Internet, all of this have to be taken into consideration in building mutual joint SLOs that absolutely needs to meet their business needs.

Todd Kesselman: I'm just curious. So you're actually talking about an SLO that's going to cut across multiple providers. So you almost need an allocation within that before you're done anyways. You're talking about basically end to end from the organization's perspective, not from the provider's perspective.

The provider is always gonna talk about it from the provider's perspective.

So you need, in order to actually manage it, you're gonna need some allocation between the [00:30:00] providers.

Gopinath Balakrishnan: I completely agree. I think, I have seen patterns of different observability strategy that customers have.

They work on building an end-to-end observability, or. Third party agents that can provide level of metrics and telemetry that not necessarily every cloud may offer, right? For example in the end, I've seen some massive enterprises have that end to end observability, they build on top of every other provider environment, which is agnostic to the underlying infrastructure itself, right?

But there are other customers who rely on Taking what cloud offers from a telemetry perspective and then applying it to their whole observability stack that they've built on. Yeah, so it's a combination of both. I've seen customers leveraging what cloud offers from my intersection point. At the same time adding having a different tools and observability, telemetry tools that can offer it when that makes sense.

Louis-Victor Jadavji: I think this would be a good time to [00:31:00] summarize some of the key takeaways from our discussion with Gopi today. I've written down a few and. We'll mention them, but anyone, please feel free to add more. So I think you mentioned that networking is a major concern and comes with a new set of problems in the cloud that, the data center crowd has to really pay a lot of attention to, um, second was that in these migrations, the enterprises moving from data centers to the cloud should put more focus on the metrics and SLOs at the application level rather than the lower layers.

And then thirdly, that we spent a bit of time talking about cadence that, 2 to 3 years seems to be the right cadence for the architectural reviews. And that despite the changes in the GenAI world being really fast, you would stick with it and that it's the right way to approach it is to decouple the AI and core architectural decisions where the prior might need more rapid discussions memory maybe every 6 months as a sweet spot, but for the core stuff 2 or 3 years it's pretty set.

And even that might be a [00:32:00] bit aggressive at times. What else should we add to the mix folks?

Gopinath Balakrishnan: I just want to add one thing, right? I think this building the scalable, secure multi cloud and hybrid infrastructure is complex and it's not easy and staying focused on solving the business problem and coming up with very open, transparent partnership more with the cloud service providers will have a lot more successful adoption.

In making sure that your applications and delivered successfully

Louis-Victor Jadavji: Thank you so much. Gopi. We really are grateful for you sharing your insights and expertise. Thank you Abhishek and Todd for helping me host this discussion. And for our audience, if you're looking for help with technology decisions, be sure to check out Taloflow to streamline your vendor research process.

Thank you everyone.