#97 – Building Storage Using NVMe/TCP with Kam Eshghi from Lightbits Labs

#97 – Building Storage Using NVMe/TCP with Kam Eshghi from Lightbits Labs

Chris EvansGuest Speakers

This episode was recorded live at Dell Technologies World 2019.  Chris talks to Kam Eshghi, VP of Strategy & Business Development at Lightbits Labs.  Lightbits has pioneered the development of NVMe/TCP, which allows standard Ethernet NICs to be used for NVMe-over-Fabrics storage traffic.

The ability to implement NVMe-oF without custom NICs (like RDMA) offers the potential to deploy NVMe fabrics in hyper-scaler and enterprise data centres with lower cost and complexity compared to existing solutions, but what are the drawbacks?  Kam explains how NVMe/TCP can be implemented on high-speed networks that even include multiple switch hops or operate between data centres.  The solution offers significant benefits with disaggregated and composable infrastructures where compute and storage can be mapped dynamically across the network.

Lightbits implements NVMe/TCP as software or as a storage array (SuperSSD), which includes an optional PCIe card to offload some more CPU-intensive tasks.  You can find more information at https://www.lightbitslabs.com/.

Elapsed Time: 00:20:24


  • 00:00:00 – Intros
  • 00:01:00 – What is NVMe/TCP?
  • 00:04:00 – Why use NVMe/TCP instead of RDMA?
  • 00:07:00 – NVMe/TCP will work across multiple network hops and switches
  • 00:08:00 – How is network performance managed?
  • 00:09:30 – How would NVMe/TCP be deployed?
  • 00:12:20 – Reference code and an SPDK is available for target-mode NVMe/TCP
  • 00:14:00 – How are enterprise customers using NVMe/TCP?
  • 00;16:30 – Teaser – vSAN could be enabled with NVMe/TCP
  • 00:17:40 – Wrap Up and a view of the future


[bg_collapse view=”button-orange” color=”#4a4949″ icon=”arrow” expand_text=”Show More” collapse_text=”Show Less” ]

Chris Evans:                  Hi, this is Chris Evans recording another Storage Unpacked podcast this week at Dell Technologies World. I keep wanting to call it Dell EMC World, but obviously I need to get out of that. I’m joined by Kam Eshghi. Kam, could you introduce yourself and the company and then we’ll start the conversation.

Kam Eshghi:                  Thanks, Chris, for having me. I am part of Lightbits Labs. Lightbits is a Israeli based start up. We have offices in Silicon Valley and New York City, about three years old. What we build is a software defined solution for disaggregation, separating storage from compute, so you can scale them independently, you can get operational efficiencies and better utilization of your infrastructure. We do all that over standard TCI/IP. We pioneered a new approach called NVMe over TCP that allows any compute node connect to any storage node within a large data center and provide storage at performance which is equivalent to direct attach storage.

Chris Evans:                  Okay, great. That’s a wonderful bit of background there to start us off. And obviously, for anybody who’s listening, this is about NVMe, and specifically NVMe over Fabrics, of which NVMe over TCP is a component. Now, just to set some background for some people who may not understand the technology in enough detail yet, because it’s really, I would say, quite an emerging, quite a new technology?

Kam Eshghi:                  That’s right.

Chris Evans:                  Obviously, NVMe initially was a device connected protocol for drives that were put into a server-

Kam Eshghi:                  Yes.

Chris Evans:                  … For example, and we’re at the Dell show, and Dell have had that, they said earlier today, for about three years in their server platform brand, obviously in laptops and so on. But, what we’re talking about here is being able to take that NVMe protocol and push that across a network.

Kam Eshghi:                  Exactly.

Chris Evans:                  And typically, we already see already existing network protocols like RDMA and obviously-

Kam Eshghi:                  Fibre Channel.

Chris Evans:                  Fibre Channel and so on. They exist today, but obviously they do have a certain degree of issue with them in terms of the way we use them, don’t we?

Kam Eshghi:                  Exactly. Yeah, Chris, as you said, NVMe has actually been around for many years. It was originally defined back in 2010 as a direct attached host control interface for PCIS [inaudible 00:02:14]. And then, a few years later, we started looking at, “Well, what if we were to extend an NVMe to remote pool of SSDs, and that way be able to separate storage and compute and do that over different Fabrics?” Initially … So, that’s how NVMe with Fabrics was born. Initially it was defined for RDMA and Fibre Channel Fabrics, which worked really well for smaller scale deployments. Fibre Channel, of course, has been around in enterprise for many years and for any sort of brownfield enterprise deployment, where you want to continue to use your existing Fibre Channel infrastructure, NVMe over Fibre Channel makes a lot of sense.

Chris Evans:                  Yeah.

Kam Eshghi:                  But then, we started, three years ago, and we worked with customers to help them with this aggregation of storage and compute, and we realized that NVMe over RDMA is really not an option when you’re deploying at scale. It’s not intended for data center scale separation of storage and compute. We came up with NVMe over TCP approach. Our spec got the attention of Facebook. We did a demo with Facebook back in summer of 2017, where we showed how we can get latencies that are equivalent to direct attached and run it over standard vanilla TCP/IP. So, the model is, run the software on the storage server, don’t touch the client, because NVMe over TCP is now a standard. Don’t need to touch the networks, and you get volumes attached to compute nodes that behave as if … You know, from an application point of view, the performance is indistinguishable between direct attached and this disaggregated model.

Chris Evans:                  Okay, right. So, let’s dig down into why using NVMe over TCP, from a hardware perspective and a device perspective is going to be more practical than RDMA. Now, in my simple mind, I think the obvious thing is that RDMA nics are a more expensive piece of technology as other switches.

Kam Eshghi:                  Right.

Chris Evans:                  And, you would be going back to that day of Fibre Channel where when you built the Fibre Channel network it was very much separate hardware, separate switches, separate devices, and a whole set of management processes that went around knowing and understanding Fibre Channel.

Kam Eshghi:                  Right.

Chris Evans:                  It seems to me that if you go down the RDMA route you’re, to a certain degree, going down something similar.

Kam Eshghi:                  That’s right. Well, first of all, with RDMA you need an RDMA nic in every single client, which is an issue from any deployments where they don’t have that already deployed. So, if you go into a large data center, let’s say a cloud service provider or webscaler, and you’ve got to change in all 50,000 or 100,000 notes to support RDMA nics, that’s a big undertaking. Just right off the bat, that becomes a big hurdle for RDMA.

Kam Eshghi:                  Then, you have to make sure your network can support close to lossless connectivity between the compute and storage nodes which, again, is very difficult to do at scale. Whereas, with TCP/IP, you can have … We take our solution to set up POCs within a few hours, and you can connect compute nodes that are five, six hops away, networking hops away and connect them, and you don’t have any issues there. So, with RDMA, there is a networking infrastructure set up that’s required, which is extremely complex. And then, as you mentioned, there’s also [interactibility 00:05:31] issues between different vendors. Some of that is going to be addressed over time, but still RDMA seems to be much more suitable for rack-scale storage as opposed to really connecting any storage node to any compute node in the data center.

Chris Evans:                  Right, okay. So, that’s an interesting distinction. Let’s just talk about that for a second. Would we see RDMA being a solution where people, perhaps, put in top of rack RDMA switch, lots of nodes that have got RDMA nics in them, and some storage somewhere to connect as a model? That’s one way of doing it from a rack-scale perspective, but you weren’t looking at that. You were looking at the idea that within my data center I might have storage anywhere, potentially-

Kam Eshghi:                  Exactly.

Chris Evans:                  … In different boxes and service sitting in rack one might want to connect with something that’s sitting in rack 26.

Kam Eshghi:                  That’s right.

Chris Evans:                  You know, a couple of hops away, maybe even a couple of racks away.

Kam Eshghi:                  Maybe even a data center away.

Chris Evans:                  Okay.

Kam Eshghi:                  Yeah, so we have … I’ll give an example. We have a Fortune 500 client service provider that is using us with live production traffic today, and their model is to put a storage server eventually in every rack, and that storage server is not servicing just that rack, but it’s actually got east-west traffic to compute nodes that are in any line at the data center, and that way they can create a cluster with whatever ratio of storage to compute that they want. And, depending on the application and the work load, and the resources could be spread out throughout the data center. They run the workload and then they can figure out the resources and set it up for something else. And, by the way, they did even try connections between data centers. Of course, the latency is higher because it’s going across data centers, but it still works. You can still connect. It’s TCIP/IP. That’s the beauty of the simplicity of TCP/IP, that you can enable really any to any connection.

Chris Evans:                  Now, the one thing that would strike me as being the most obvious issue here is, within TCP/IP and my own limited networking knowledge … I’m not in any way a networking person. That’s somebody else’s job in the world. But, from my limited knowledge, I know that there is an issue with using TCP/IP in terms of latency, in terms of the way that the network responds and handles retries and various other features. How are you getting over all of those potential risks, because when you’re looking at storage data you can’t afford to drop packets, you can’t afford to lose data. You’ve sort of, somehow, got to marry the two.

Kam Eshghi:                  Absolutely. If you look at the entire latency that an application sees, and what’s contributing to it, most of it actually comes from the SST itself. So, you’re right in that if your network is already congested, then you have to upgrade the network to be able to support storage over the network. Most of the customers, almost all the customers that we’re talking to, and we’re mostly focused on cloud service providers and also private cloud in enterprise, they already have plenty of headroom in their networking infrastructure. They may be utilizing 20%, so they don’t get into a situation where they’re congested and losing packets.

Kam Eshghi:                  One of the things that we’ve done is on the target side. So, on the client’s side, we’re relying on the standard NVMe over TCP drivers, which we contributed to the community and it’s upstream and it’s being enhanced by the whole community. On the target side, we have our own optimized stack. And, that optimized stack includes end to end flow control and optimizations that get us much better latency and performance than what you can get with a standard code on the target’s side. There are things you can do to improve and reduce the packet drops. And then, also, in our Lightbit solution, we have a global FTL, a layer of software that is managing the SSDs that will reduce latency of the SSDs themselves. So, for example, when you avoid conflicts between [reads and writes 00:09:26] going to the SSDs, you can dramatically improve your latency, especially your tail latency, the consistency of the latency. So, we put a lot of emphasis on how do we get better latency out of the SSDs, because that’s the biggest contributor to the overall latency.

Chris Evans:                  Okay. Let’s think about, then, deployment of the technology because the first and one of the earliest comments you made was the fact that if you’re looking at a large data center and you’re a cloud service provider, or you mentioned Facebook as one of your example customers using effectively generic technology or open compute type technology.

Kam Eshghi:                  Right. They’ve been a partner. That’s right.

Chris Evans:                  Yeah. So, I’m not saying they’re necessarily using it, I was just using that as an example.

Kam Eshghi:                  Yes.

Chris Evans:                  It seems pretty obvious that if you’ve got the ability to put in something where you just have one nic or two nics or a small number of nics in the standard server, this basically gives you flexibility to become much more composable with your infrastructure.

Kam Eshghi:                  Exactly.

Chris Evans:                  So, there could be one thing one day, one thing the next, and I guess even service could be storage service as well as host service. There’s no reason why a storage device would need to be specific to that.

Kam Eshghi:                  Exactly. It’s the same server, right? So, you could … The customer is buying these servers and deploying it and deciding later on what software stack to run, to either turn it into a Lightbit storage server or something else. It can even be an application server. The deployment is very, very simple. We do support different size hardware platforms, right? It could be … We have customers that are using these micro storage servers that only have four SSDs on them, all the way to OCP with 32 SSDs. With Dell, we have a partnership where we are selling through them, through the Dell OEM team, where they take our acceleration card and integrate it into their server, its Dell 740xd server, and they are shipping it to a customer.

Kam Eshghi:                  So, we’re going to see more of that kind of model where if our acceleration card is used, we have a server partner that is actually doing the integration. And, if it’s a software only approach, then the customer can just install it on their own.

Chris Evans:                  Okay, so let’s divide the discussion there slightly, a little. If I was a customer, I could come along and use that technology, and when I say that technology I mean NVMe over TCP because this is now ratified and part of standards.

Kam Eshghi:                  Correct.

Chris Evans:                  And you’ve obviously worked to put that in place and that’s been adopted.

Kam Eshghi:                  Yes.

Chris Evans:                  [crosstalk 00:11:53] Today. So, theoretically, if I wanted to write my own target, I guess I could write my own storage platform if I really wanted to.

Kam Eshghi:                  Yes.

Chris Evans:                  Although, that might be a bit of work. Some cloud providers might choose that as a route. But, effectively, the other part of your company, or the main part of what you’re doing, is selling the target site as a software solution or even as a hardware solution.

Kam Eshghi:                  That’s right. There is a reference code available for the target implementation of NVMe over TCP, which is actually based on the code that we provided to the community, and what we … So, that is available. If anybody wants to pick that up and build a solution around it, that’s one option. Of course, we would love to get more and more adoption of NVMe over TCP. I think that the larger and healthier the ecosystem the better for the whole industry.

Kam Eshghi:                  In terms of our implementation, we optimized the NVMe over TCP stack, which gives you roughly two to three X better performance than the standard code, just coming derived from that target proprietary implementation that we have. In addition, we have the management of the SSDs through the back end software that we have that we call Global FTL, which gives you additional benefits in terms of better endurance of the SSDs and lowered tail latency. And, data services that you get, like compression, erasure coding, that again if you were to compare to direct attached storage, you don’t get those data services either.

Kam Eshghi:                  We make better use of the SSDs and provide a more optimized connection with our NVMe over TCP connection.

Chris Evans:                  Okay. We can probably hear the music cranking up a bit outside somewhere. I don’t know where that’s coming from. That seems to be a standard scenario in doing these recordings, which is quite funny. But, I think everybody should still be able to hear us okay. Enjoy the music in the background as well.

Chris Evans:                  So, let’s talk about enterprise customers then.

Kam Eshghi:                  Sure.

Chris Evans:                  I think, from a client service provider, I can see that the logic to that is pretty straightforward. They’re going to rack generic hardware, really straightforward [inaudible 00:13:55] and just configure it on demand. What about enterprise customers who maybe want to look for things like ensuring that they’ve got the security policies in place that need to, I guess, build and design like they would have taken, say, some of the [MC 00:14:10] technology that’s on the floor today as a sort of platform. How are they taking the technology and using it? Is there any difference to the way to, say, an [ASP 00:14:18] would use it?

Kam Eshghi:                  Yes, it is. In fact, for enterprise we have a product called SuperSSD, which is basically a storage appliance that uses our technology inside of the box, and today that storage appliance is a Dell 740xd, that is running our Light OS disaggregation software and has our LightField solution card built in. And, it’s a complete solution, so it comes with-

Chris Evans:                  Sorry to interrupt but just remind us about the acceleration card. Did we go into detail about it? I believe we did, didn’t we?

Kam Eshghi:                  Actually, we did not. Let me-

Chris Evans:                  So, let’s just say-

Kam Eshghi:                  Briefly?

Chris Evans:                  Yeah, explain what the card actually does.

Kam Eshghi:                  Sure. The card is a PCI add-on card. It’s really the only piece of hardware that we build, and it’s a storage acceleration card that’s sitting in the back and working with our software to accelerate data reduction, data protection. It does accelerate NVMe over TCP and it also accelerates some of the global FTL functions. So, if you have a storage server platform that has a mid-range CPU inside of it or lower core count, with a card you can still get wire speed performance and all those data services. It’s optional.

Kam Eshghi:                  So, for example, for many AMD EPYC based storage servers where there’s plenty of cores, you don’t need the card. Or, if you have a more premium, like a platinum version of the Intel Xeon, you don’t need the card. But, if you want … There’s a trade off between CPU cores versus the card. So, SuperSSD already has the card built in. It has our software running, it’s in a Dell platform, and we sell it as a complete solution. It’s really plug and play. It comes with management orchestration. It has 200 gig ethernet ports and capacity that goes up to hundreds of terabytes. Very easy to consume. And for enterprise, that seems to be where we’re getting a lot of traction, because they want to buy a complete solution. One thing to add to that is that we’re also working with VMware because vSAN is seeing interest from their customers where they say, “I want to just be able to scale my storage without having to add another HCI node and buy another license.”

Chris Evans:                  Good point.

Kam Eshghi:                  They’re seeing their storage grow faster than their CPU performance. We’re in a partnership now with VMware and we’re building a complete solution with them, allowing vSAN customers to be able to scale storage independently from computer.

Chris Evans:                  Okay, so just out of interest, how does that work? Does that mean your technology will sit alongside and there’ll be some sort of … I was going to say, “What’s the opposite of the target?” The-

Kam Eshghi:                  Initiate.

Chris Evans:                  Initiate device in ESX that will talk to your platform?

Kam Eshghi:                  Yes. So, I won’t get into too much details, because I could get into trouble for that-

Chris Evans:                  That’s fine but you can make up [crosstalk 00:16:58] for everybody.

Kam Eshghi:                  But, high level, let me just say that if you have any sort of SAN infrastructure where you have compute nodes that have internal SSDs, but want to be able to scale storage separately, they can attach to a Lightbits storage server. It could be a SuperSSD, and then basically get any size volume that they want, connect it, and then they have access to that volume as if it’s a direct attached storage.

Chris Evans:                  Okay, so it’s like the experience of vSAN but not necessarily the hardware installation and process that goes around deploying vSAN.

Kam Eshghi:                  It’s not another vSAN HCI, no. But, it’s extending the storage of existing nodes.

Chris Evans:                  Okay.

Kam Eshghi:                  Yeah.

Chris Evans:                  All right. Great. Okay, so if people want to follow up and find out a bit more about the technology, where would they go on your website? What’s your website and-

Kam Eshghi:                  Lightbitslabs.com.

Chris Evans:                  Okay, brilliant. I’m going to ask you one question to finish off and that’s around the whole idea of where we think this is going to go and where it’s going to end up. I’d imagine that you might say, “Oh, TCP, NVMe over TCP will be the platform that will take all the protocol that will take over.” But, do you really think that’s where we’re going to end up? I mean, if you look at the way that customers are consuming NVMe over Fabrics, do you think the TCP implementation is the most logical?

Kam Eshghi:                  I think it’s absolutely going to take off because the whole idea of NVMe over Fabrics makes a lot of sense, but it’s been limited by the difficulty in adopting some of the existing Fabrics. With NVMe over TCP, you completely unleash the potential of NVMe over Fabrics. How is this going to be … What is this replacing? Well, there’s really two areas. One is, as we said, direct attached storage where you already have high performance but the infrastructure is underutilized. You have stranded capacity. There’s no reason to do that anymore because you can do this aggregation in a very simple way and get equivalent performance.

Kam Eshghi:                  And the second is, tradition SAN where, like iSCSI, where you have separation and centralized storage, and you have these data services, but you’re getting low performance. So, we can get orders of magnitude faster than iSCSI out of NVMe over TCP. So, essentially, we’re converging the DAS and the traditional SAN into a new approach, which is based on NVMe over Fabrics and is easy to deploy.

Chris Evans:                  Okay, all right. Well, we’ll look forward to seeing how it develops over the coming months and years. I think this still an evolving area and it’s still an interesting area where customers who are probably used to things like iSCSI and Fibre Channel are still trying to work out where they should go with this. So, I think it’ll be great to come back in six months and see where we stand with it. But, for now, Kam, thanks for joining me and appreciate it and catch up with you soon.

Kam Eshghi:                  Thank you, very much. Appreciate the opportunity.

Chris Evans:                  Thanks. Bye, bye.



Kam’s Bio

Kam Eshghi, VP of Strategy & Business Development, Lightbits Labs  Kam joined Lightbits Labs from DellEMC and has over 20yrs of experience in strategic marketing and business development with startups and public companies. Most recently as VP of strategic alliances at startup DSSD, Kam led business development with technology partners and developed DSSD’s partnership with EMC, leading to EMC’s acquisition of DSSD. Previously as Sr. Director of Marketing & Business Development at IDT, Kam built their NVMe Controller business from scratch. Previous to that Kam worked in data center storage, compute and networking markets at HP, Intel, and Crosslayer Networks. Kam is a U.C. Berkeley and MIT graduate with a BS and MS in Electrical Engineering and Computer Science and an MBA.

Related Podcasts & Blogs

Copyright (c) 2016-2019 Storage Unpacked.  No reproduction or re-use without permission. Podcast Episode 527A