#98 - Shared Storage for Scale-out Databases with Walt Hinton from Pavilion Data

In this third recording from Dell Technologies World 2019, Chris talks with Walt Hinton, Head of Corporate and Product Marketing at Pavilion Data Systems. The topic of conversation is the alignment of shared storage with scale-out NoSQL databases that typically place storage components on separate local disks.

As rack-scale computing becomes more widely adopted, local storage will migrate back into bottom-of-rack shared storage infrastructure. In this instance, does it make sense to put shards of databases back on the same hardware? Walt thinks it is the next logical step in optimising scale-out applications and introducing what could be classed as SAN 2.0. Shared rack-scale storage provides the capability to reduce east-west rebuild traffic and the associated risk exposure that failures introduce.

Walt also discusses performance improvements that have been seen with the introduction of release 2.2 of the software that drives the Pavilion storage platform – 90GB/s throughput at 40µs latencies. You can find out more about Pavilion Data and their NVMe-oF Storage Platform at https://paviliondata.com/.

Elapsed Time: 00:17:58

Timeline

00:00:00 – Intros
00:00:50 – What is the Pavilion Data Platform?
00:01:30 – How are requirements for shared storage evolving?
00:04:00 – Rebuild failures become a problem with scale-out architectures
00:06:30 – Disaggregation with NVMe enables rack-scale architectures
00:08:30 – Should we be reconsolidating sharded database applications?
00:11:30 – What are the options in rack-scale designs?
00:13:30 – Release 2.2 of Pavilion software delivering greater performance
00:15:00 – Disaggregated solutions aim to get out of the data path
00:17:00 – Wrap up

Transcript

[bg_collapse view=”button-orange” color=”#4a4949″ expand_text=”Show More” collapse_text=”Show Less” ]

Chris Evans: This is Chris Evans recording another Storage Unpacked podcast at Dell Technologies World. This time ’round, I’m joined by Walt Hinton. Walt-

Walter Hinton: Hello, Chris.

Chris Evans: Good to see you again.

Walter Hinton: Good to see you as well.

Chris Evans: Are you enjoying the event so far?

Walter Hinton: It’s been really busy. It’s terrific, really nice to see Dell taking such a solutions approach to their overall portfolio.

Chris Evans: Yeah. Great. So, you’re from Pavilion Data. You’ve been on the podcast before. We had a discussion before when we were [crosstalk 00:00:31]-

Walter Hinton: San Jose.

Chris Evans: San Jose, Santa Clara.

Walter Hinton: Yes.

Chris Evans: I can never remember exactly which bit your office is in. Obviously, Pavilion’s got an architecture based around… Well, it’s shared storage, but obviously shared storage in a certain way.

Walter Hinton: Yes.

Chris Evans: Do you want to just give people a quick reminder about the technology and so on?

Walter Hinton: Sure. So, we’ve taken the concept of all-flash, but done it in a very specific NVME, from the ground up, architecture. Within NVME being such a high performance type of storage medium, you really need to have a number of controllers that match that NVME drive capacity. So, we have an array that has up to 20 storage controllers, 40 100 gigabits, InfiniBand or ethernet connects, and then 72 NVME drives on the back end that we provide storage and data management services on.

Chris Evans: Okay, great. One of the things that’s really interesting looking at an event like this, where we’re talking about cloud, is that in cloud, we’re seeing new architectures come along, or at least new ways of deploying technology like databases. Previously, traditionally, we might have put them into a virtual machine, and then that virtual machine becomes the object that gets moved around. But obviously, technology has changed a lot in terms of how we deploy databases with things like NoSQL.

Walter Hinton: Right.

Chris Evans: We see now, people are [sharding 00:01:53] data, building servers where they put local storage in. But from a scaling perspective, that’s not necessarily going to work very well for people. Things like NVME are going to help solve that, aren’t they?

Walter Hinton: Yes. In fact, so I’ve been working in performance storage for a very long time, brought forward some of the first PCIE, SSVs, inside servers. Brought forward the first couple of generations of NVME, and it was revolutionary. We moved from a traditional SAN architecture to these new scale-out models where we basically put storage inside the server and add nodes in order to achieve performance or capacity. It brought forward a whole lot of new database architectures, new companies, and new business models. But, yeah. We’re starting to run into some issues there.

Chris Evans: Yeah. I think it’s a really interesting scenario, that. We had SAN, which was [crosstalk 00:02:50]-

Walter Hinton: Exactly.

Chris Evans: … my thing from, like, 20 years ago. We resolved a lot of issues with SAN, because we had issues of sprawl and service that had drives in them originally, and there was a lot of maintenance, and overhead, and waste because of that. As an example of that, obviously drives got to a huge size.

Walter Hinton: Right.

Chris Evans: So you could be putting drives into a server and having to put [RAID 00:03:11] on there, and use it in a distributive fashion in order to get performance, but wasting a massive amount of capacity.

Walter Hinton: Exactly.

Chris Evans: So, we resolved things like that with SAN, but we’ve gone through, I guess, a pendulum swing, and we’ve gone away from that. We’ve ended up with servers that have got data and storage within them individually, like you just said. That was a great thing, because that improved the performance. But now we’re swinging back to that need to somehow centralize in order to gain those efficiencies of use again, and optimize the actual capacity that we’ve got available. I mean, that must be one of the biggest issues currently today.

Walter Hinton: It really is. I’ve had a chance now, being with Pavilion for about five months, to meet with some of the largest hyperscale companies on the planet. They’ve told me, almost every single one, that we’ve set our architecture where we won’t deploy an SSD that’s larger than two terabytes. I thought, “Now, why on earth would you do that?” Because you become really inefficient. You’re deploying more servers, you’re deploying more SSDs. NVME is the most expensive storage you could buy.

Chris Evans: Yeah.

Walter Hinton: And the answer was rebuild times when a node fails. So, depending on the size of the cluster, the numbers of shards, you get to a point where across that cluster, rebuilding that node, it can take as much as 25 minutes for a single terabyte to be rebuilt. So a two terabyte drive, you’re almost at an hour. All that time, I’m pulling data from each of the sharded nodes, so I’m impacting application performance while I’m trying to rebuild the drives.

Chris Evans: That’s right, ’cause it’s east/west traffic that’s going back and forth between those nodes.

Walter Hinton: Exactly. Yes.

Chris Evans: That was one of the issues, if anything, that HCI had, that you were introducing that additional traffic that could be there. I guess when you get into an environment where you’ve not just got one or two drives, where you’ve got tens of thousands of drives, you’re always in a rebuild scenario. There’s always something going to be happening somewhere.

Walter Hinton: Well, the other thing that I’ve seen now is… And of course, coming from Western Digital and offering NVME drives, we see people putting multiple drives in a server, doing some form of RAID at server level. Well, same problem. Now, all of a sudden, I’ve deployed 50% of my capacity just for data protection purposes. So it leads to this whole idea that… Disaggregation has to happen, and drive makers are… In order to give you more IOPS, or more throughput, they’re going to have to pack more NAND into that two and a half in form factor. So all of a sudden, capacities are going to continue to grow. I mean, we’ve got Toshiba already shipping 15 terabyte NVME drives. We know 30 terabyte are right around the corner. If it takes 25 minutes to rebuild a single terabyte, and I’m at 15, this node recovery issue is a serious problem.

Chris Evans: Yeah, I agree. And you are significantly at risk. Like, you used to be… The same problems that we’ve seen time and time again come back up again. You’re exposing that risk of a second failure, or another failure, or some other type of failure while you’re trying to do that rebuild.

Walter Hinton: Yeah.

Chris Evans: And clearly one of the things that isn’t necessarily obvious is: if you have a failure in one server and you’re relying on data from other servers to do that rebuild, and they have some other issue, your data rebuild process is lost at that point.

Walter Hinton: Everything just compounds.

Chris Evans: Yeah.

Walter Hinton: So, we’re seeing various solutions coming out in this category called disaggregation, where I now put a number of NVME drives into a chassis. I put that at the bottom of a rack. It may serve one rack, multiple racks. But you still have the issue of, if I’ve got an architecture that’s a traditional sort of all-flash approach, and I’ve got two controllers, yes, I can rebuild the drive that fails, and I can do that in a way that doesn’t impact east/west or north/south traffic. But I’m still kind of tied to how fast can that controller rebuild that drive across its volume? The thing that’s really different about Pavilion, and I didn’t get this until I met with some of these customers, is that because we have this sort of parallelized controller architecture, where I can go up to 20 controllers in a single system, I can gang together controllers to rebuild the drive.

Walter Hinton: So what we like to say is, “Figure about five minutes per terabyte with a Pavilion solution.” Because it’s behind some number of controllers, I’m not taking any impact from the servers that have their sharded data, because that sharded data actually lives in a volume striped across NVME drives, and with some number of controllers that can be swarmed together to do a rebuild.

Chris Evans: So do you think people will look at it and think, “Well, hold on. We were sharding it at application level because there was a good reason for that. It gave us performance. It gave us resiliency. Then you’re sort of suggesting that we now reconsolidate that into a box that’s now got the shards in one central location.”

Walter Hinton: Right.

Chris Evans: Do you think people might look at that and think, “Well, that seems to go against the whole point of sharding in the first place.”

Walter Hinton: There is a school of thought that says, “Hey, I design. I’m a database architect, so I design my database with all these concepts in mind for resiliency, for blast radius, for et cetera.” And there’s no need to really change that. The only difference is the storage doesn’t sit inside the server. It actually sits bottom of rack. Again, if you’ve got an architecture like Pavilion, and there are others, and there will be others. But where you can have controller scale at the same time as you’re scaling the number of servers, with this disaggregated approach, if the application needs to be capacity-centric, design for capacity. If it needs to be performance-centric, design for performance. We happen to have a system where you can both in the same thing. I can set aside some number of controllers and volumes specific for really high read performance. I can use others for endurance and heavy read/write workloads. I can even deploy some capacity NVME as well for things like snapshots, clones, and backup.

Chris Evans: I think we’re moving to a situation where it seems, again, to make sense to centralize our storage. It’s all about what we centralize in terms of the features, and not necessarily centralizing everything that we used to do previously. I’ll give an example of that, because I think that might sound like a bit of a vague comment. But there are certain things it makes sense to centralize, like data protection.

Walter Hinton: Right.

Chris Evans: Implementing things like RAID. But other pieces of the technology that can be done better in the application, would be better left to the application to do. So you give out a [loan 00:10:20] or a volume and let the application decide how it carves that up, for example.

Walter Hinton: Right, right. Yeah, and another great example, I get asked by industry pundits, “Well, where’s your replication?” Well, frankly, the applications are really good at that. Let the application manage your replication. One of the features that is important, and we’re seeing this in financial services with very large data lakes, traditional network backup doesn’t work well in a sharded cluster environment. Again, east/west, north/south traffic, and I’ve got a specific backup window. Once I’m at a petabyte or more in that data lake, it’s a real problem. So having consolidated storage where I can do snapshots, I can clone that snapshot and hand it off to a backup application, I don’t impact network performance. Effectively, it looks like zero downtime. It’s changing the backup window paradigm.

Chris Evans: Yeah, so there are certain features that makes sense to keep in the storage. Let the application deal with it with itself. Maybe that’s where we’re headed in the future. I’d be interested to understand what you think about the idea of rack scale in that scenario, because as the way you’re designing your product, you’re effectively saying… For example, you might have one storage unit per rack or two storage units per rack.

Walter Hinton: Right.

Chris Evans: But effectively you’re designing it so that any of the servers that are sitting in that rack can access any of that storage, and therefore you’re building what could be also called ‘composable infrastructure.’

Walter Hinton: Yes. Yeah, in fact, I would say that Pavilion fits in that category of composable disaggregated infrastructures defined by IDC. Now, there are some things, though, that you also want to think about in terms of storage that services these applications in different ways. So, we’re very much about performance and latency. This week, we are talking about right performance. Nobody really wants to talk about the right performance, ’cause the numbers don’t look so good. We’re doing 90 gigabytes a second parallel rights, 128 [k-byte blocks 00:12:23] at 40 microsecond latency. Now, of course, this is on RDMA-based ethernet, so go to TCP, your mileage will vary a bit.

Chris Evans: Yeah.

Walter Hinton: But still, NVME over fabrics, with those kinds of latencies, wow. That’s really interesting. At the same time, this is not a tier two or tier three storage solution. So we see things like VAST, where they’re implementing QLC NAND technology, which is much more about lower cost and longer term retention, sort of worm-like technology, right [crosstalk 00:13:00]-

Chris Evans: Yeah, yeah.

Walter Hinton: Very different than where we’re really targeted, which is, this is about a better way than traditional what we call ‘direct attached storage,’ where it’s NVME drives inside of servers, whether it’s one and then the sharding problem, or two and the poor utilization problem as you apply different RAID techniques.

Chris Evans: Yep. Okay. Very quickly then, as we come towards the end of our time, talk to us performance [announcements 00:13:32] then, because we’re at event. You’re presenting at the event and your technology’s here. You have, I think, just intimated about your performance improvements, but tell us what you’ve just announced.

Walter Hinton: Yeah. We’ve introduced, released 2.2, which is really software. By the way, the whole platform is commodity technology. There’s no special [A6 00:13:56] or anything. This is Intel Broadwell, SSCs, standard gigabit, 100 gigabit ethernet, or InfiniBand. But what we’ve done is really tune our software technology for very low latency. So 40 microsecond latency from server, across fabric, to the front end of the array, through the controllers, and all the way back into the RAID 6 volume. Right performance at 90 gigabytes a second, which is kind of unheard of. At the same time, we’ve added some other enhancements around our snapshotting. So we had a basic snapshot, which was good in sort of a file system environment. But when it comes to sharded scale-out databases, you have to maintain consistency.

Chris Evans: Yeah.

Walter Hinton: So now, all of our snapshots have this consistency group capability so that when I do take a snap, and I create a clone, and I hand it off to a backup application, you know that when you bring it back, you can bring that database back to its proper state.

Chris Evans: Yeah. Okay. So it sounds like a lot of the evolutions you’re doing are coming from the software. One of the things, I think, that makes that really interesting is that if you look at the performance of the drives and then you add in any overhead, I think a lot of the modern disaggregated architectures are trying to get out of the way of adding more latency into that scenario. So optimizing the storage and keeping out of the way of the data is a real focus, I think, for these platforms.

Walter Hinton: And certainly a focus for us. We have some patents, in fact, around how we do reads and writes so that we’re not super heavy on DRAM. Keeps the platform inexpensive, but we establish connections, get out of the way.

Walter Hinton: One other thing that I think is important to talk about here, Chris, is there are different approaches to this disaggregation. One size doesn’t fit all. In some cases, a software-defined approach is appropriate for… Let’s call it a small cluster. But as you reach a scale, this whole idea of doing sort of bottom of rack, or we’ve got a customer that’s got 256 servers defined to a single one of our arrays-

Chris Evans: Right.

Walter Hinton: You need to have that sort of robustness that comes with a beefy hardware and software solution.

Chris Evans: Yeah. I look back at it and think that one of the benefits of any of the technologies… There’s an example, a technology that would’ve come from EMC that is now being sold by Dell, is that resilience and that quality in the actual product.

Walter Hinton: Yes.

Chris Evans: And I think that’s going to come back again because, as you said, 256 servers pointing to one storage solution… If that goes down, there’s a huge impact.

Walter Hinton: Exactly.

Chris Evans: So we still need to consider that, and make that our platforms are solid and reliable, I think.

Walter Hinton: You have to have fault tolerance built throughout. What I really respect about the team that created Pavilion’s system is these are industry veterans. They’re not trying to reinvent what was done with EMC Symmetrix years ago, but they’re using those same concepts that are fundamental to really good storage design, and doing it with NVME from the ground up.

Chris Evans: Yep. Brilliant. Okay. Remind everybody where they can go and find information at.

Walter Hinton: Yes. We are Pavilion Data Systems, www.paviliondata.com.

Chris Evans: Okay, that’s great. Well, thanks for spending the time and catching up with us. Catch up soon.

Walter Hinton: Thank you very much, Chris.

Chris Evans: Thanks.

Walter Hinton: Cheers.

[/bg_collapse]

Related Podcasts & Blogs

Podcast: Play in new window | Download