【IT168 评论】出身名门雅虎的Hortonworks拥有许多优秀的Hadoop架构师与源代码的贡献者，它们为Apache Hadoop项目贡献了超过80%的源代码。随着各种Hadoop发行版的涌现，Hortonworks如何能一枝独秀，坚持自己百分之百的开源路线呢?本期IT名人堂嘉宾，我们在2015中国Hadoop技术峰会上，邀请到了Hortonworks的 CTO Jeff，对他进行了独家视频访谈。
Jeff：我回顾了2014年的历程，也讲到了这一年重点发生的一些业界大事儿，整个Hadoop生态系统变得越来越成熟，变得越来越重要。在技术层面上，我还谈及了架构、SQL on Hadoop的解决方案等。此外，我还从整个开源项目的角度，预测了2015年Hadoop生态系统的发展趋势。
皮皮：您说得非常好，我们今天在谈大数据，经常会提及到3V( volume、variety、 velocity)，Hadoop是怎么来满足这些需求的?
我们Hortonworks收购了一家XA Secure的企业，我们为Apache软件基金会贡献了一个新项目Apache Ranger。这个新项目结合了一些安全性特征，被引入到了Hadoop项目的内核中，为Hadoop的发行版提供了全面的安全套件。在这个安全套件里，不管你把数据存储在Hadoop集群里，还是存储在Hive Table、或者HDFS里，我们都可以使用Apache Ranger项目来确保数据的安全性。
Jeff:你说到，很少人会直接使用Apache发行版，这个没错。事实上，当你在使用Hortonworks 数据平台的时候，你在使用开源的Apache 软件基金会的发行版。我们坚信，开源能够带来最好的价值，开源能够实现最好的创新，开源能够为数据中心引入最好的技术。因此，我们要做的事情都会围绕Apache软件基金会展开。
当然，我对其它的发行版也心生敬畏，比如Cloudera Manager、 Cloudera Navigator等，这些项目在开源的世界里发挥着非常重要的作用。而我们一直以来，都坚持将它开源，保持了整个Hadoop生态系统的纯开源的本质。除了Hortonworks，没有其它的企业还能坚持百分之百的开源。
PiPi：Hello, Jeff, Nice to meet you!You are well known overseas as Horthonworks CTO, Maybe you are not so familiar with Chinese people.So can you Introduce yourself?
Jeff Markham:Sure. My name is Jeff Markham. I am the Technical Director for Asia Pacific for Hortonworks, the providers of the only open-sourced Hadoop distribution.
PiPi：So,on China Hadoop Summit,what is your presentation?Could you share your keynote?
Jeff Markham:Sure. Today I talk about what happened in 2014 and it was an importance in the Hadoop ecosystem. We talk a little bit about architecture; we talk a little bit about the SQL on Hadoop solutions and then we look at what is coming in 2015 in Hadoop ecosystem in terms of what is available in the pure open source projects.
PiPi：When we can talk about big data, we will think of Hadoop, so people may ask， is big data equals to hadoop? What do you think of their relationship?
Jeff Markham:That’s a good question. Some people say Big Data is Hadoop, some people say Big Data is… is not Hadoop. We of course see the rise in the popularity of Big Data, very much in parallel with the rise in the popularity of Hadoop. And the reason for that, popularity for Hadoop, the reason for the huge rate of the option is that Hadoop is built on a couple of key things. One is that it’s a pure open source; two is that it runs on commodity hardware. That means anybody can start downloading and experimenting and finding out new ways to process and analyze their data. Today, well, as before, they were never able to do that. So in my opinion, yes ，I think Big Data and Hadoop are so closely related and they can be considered as the same thing .
PiPi：Hadoop is so popular during these years,it seems that everybody is talking about Hadoop. How do you see the future of the Hadoop ecosystem?
Jeff Markham:Well, I see the rate of the option only increasing, just for the same reason that I mentioned before.The fact that it’s open-sourced, the fact that it runs on commodity hardware, enables any company of any size to start ingesting and analyzing data as they have never been able to do before. So I only see the rate of the option simply increasing, uh…, during this year. Well, I think it’s going to be important this year is we’re gotta move away from how fast can one distribution versus another process’s certain query, and I think we gotta start discussing more broad-level used cases.
How can industries, uh…, such as the financial sector, how can industries, such as manufacturing, telcos, take the data that they have today and use it as a competitive advantage? I don’t think we gotta have a lot of the discussion this year on who does what query 5 seconds faster. I think we gotta have a bigger discussion on what is the overall value of Hadoop to each individual organization. How can they use it to not only monetize their data, but to give a better, deeper understanding of their customers, or their product, or their service?
PiPi:yeap,good.when we tallk about the big data , we talk about three Vs,such as volume, variety, and velocity, so how can Hadoop work to meet these needs?
Jeff Markham:Those were very common terms used to associate with Hadoop and Big Data, the V words, right?what I think is instead of trying to simplify what Hadoop is all about through those particular words, I think once again, we gotta see a shift more in… in terms of simplifying the distributions. Simplifying the distributions, so that we can take that data in… into the Hadoop ecosystem in a number of different ways.Technologies like Storm, technologies like Spark…
When this time, uh…, I was here this time last year. None of these technologies were very key in anybody’s presentation last year. Yeah, these are the technologies that are not only key, they’re virtually requirements in all the modern Hadoop architectures that we see today. So, er…, one of the things that I do see is that the components. The different individual features of each distribution are only going to grow in number; each individual component will grow in terms of its own functionality and importance to that particular distribution and… erm… I just think that you gotta see a shift again away from conversation towards the… the details of these components and more towards ease of use… erm… for the operation team, ease of use for the users, the analysts, erm… How do I address my specific used cases? I think that is the conversation that people are gotta have today, erm… going forward in 2015 when it comes to Hadoop.
PiPi:In my opinion, big data or Hadoop is used to turn raw data into US dollars or RMB, but data is valuable and sensible, so how can we keep it safe while data mining?
Jeff Markham:That’s a great question, you know. Again, I think that really relates to… uh… the issue that I just mentioned before that we are gotta give away from who does what query 10 seconds faster, 5 seconds faster, and I think we are gotta look at the entire distribution holistically particularly in the… the area of security. The area of security has always been contrarily popular believe a huge area of focus in the Hadoop community. What we have done in Hortonworks is we have an acquisition of a company, called XA Secure that we put into open source in Apache software foundation as a project called Apache Ranger. What Apache Ranger does, in combination with a lot of security features that are… are starting to appear in the core Hadoop projects themselves is provide a comprehensive security suite for the Hadoop distribution. So, instead of having different slyvo (07:10) security for each component, instead of having fragmented security across each individual distribution, we have done for the first time, is make available in pure open source, a comprehensive security suite, no matter where your data is stored in the Hadoop cluster that you have, whether it’s in a Hive table, HBase table, HDFS itself. That data can be secured in a four- comprehensive manner using the Apache Ranger project. The Hortonworks Data Platform is the only platform, is the only distribution to feature this, and again, it is pure open source.
PiPi:Although hadoop is so popular but seldom people are using the straight Apache distribution.We notice,there are several Hadoop distributions that emerge right now, including Cloudera,\IBM\MIcrosoft\HortonWorks\Amazon .Why do so many distributions emerge now ?How do you see distribution market shaping up?
Jeff Markham:Well, I think… First of all, let me answer… answer the first question about the different distributions and different vendors. When you say people are using… are not using the plain… correct… the plain standard Apache distribution. In fact, if you are using the Hortonworks Data Platform, you are using the pure Apache Software Foundation distributions.
That’s pure open source. Hortonworks doesn’t believe in… in providing any proprietary software, providing any walk-in toward any customer that might want to use Hadoop. We believe is that open source gives you the best value; gives you the best innovation; gives you the best technology for your data center. So, what we do is we do all our work inside the Apache Software Foundation. We have zero code that we have is the proprietary. So, somebody is using the Hortonworks Data Platform, they are in fact using pure Apache Software Foundation projects. Thus that, secondly, what I would think about the other distributions… I have a lot of respect for these other distributions. They… they do a lot to advance the cost of Hadoop, but a lot of distributions have done besides Hortonworks is take some of the core open source projects and then add proprietary products around it. For example, in Cloudera, we see some products like the Cloudera Manager, Cloudera Navigator, things that are close-sourced proprietary products that are addressed in the open source world, the MBuy (10:00) project, the Apache Falcon project. These are projects that address the used cases in Cloudier Manager, Cloudier Navigator, and more, yet are pure open source. That’s our philosophy. Our philosophy is what we need to do… uh… to advance the Hadoop ecosystem, we need to do it in a pure open source. Otherwise, the distributions become fragmented; otherwise, we have a situation… uh… like we had with Unix. Well, we have many flavors and no one standard because there was no one company to enforce the pure open nature of that project. With Hadoop, that one company that enforces the pure open nature of the entire Hadoop eco-system is Hortonworks and Hortonworks only. There is no other company that ships 100% pure open source, only Hortonworks does that.
PiPi:What would you say to Chinese CTO who works on Hadoop and big data?
Jeff Markham:My advice is this. When we work in the Apache software foundation putting all our code out there, what we do everyday as an engineering team, is make sure that as we build core Hadoop, we leverage your existing skillset, we leverage your existing investments in the products you have and your data center. So, if you have Oracle, if you have Microsoft SQL Server, if you have Teradata, if you have SAS, Tabblo, SpaFire (11:38), whatever it is, we want you to be able to use Hadoop and integrate with your existing investments and technologies that you have, and be able to leverage your existing skillset. So, my advice to the CTL is to put Hadoop into your data center, integrated with the products that you already have because we are open source. It’s likely that we have a partnership with whatever the technology is that you are using today. So, we want your existing analytic software package that you are using… uh… to be continued to be used today. And the only thing your analysts know is that they are analysing more data and more different kinds of data. That’s the perfect state for us with Hadoop is that the end users may not even know they are actually using Hadoop. All they know is they are using even more data and more different types of data.
PiPi: Could you give some advice for those people who want to start using Hadoop?
Jeff Markham:For individuals, I’d say the best way to get started is to go to Hortonworks.com and download the Sandbox. The Sandbox is a single virtual machine that people can use… free-use on their desktop right away. They can use it with their VMware; they can use it with VirtualBox; they can use it with HyperBeam; they can use it on Windows; they can use it on Mac. They can get started right away. Download the Sandbox, then follow along a lot of different tutorials — tutorials on how to use Hive; how to use Pig (13:20); what is manual produce (13:21); what is Ranger; how do I start configuring Ranger to secure my entire ecosystem; how do I start using embarry (13:30) to manage, monitor… uh… my cluster. All these things you are able to do with the Hortonworks Sandbox. Free of charge, download it today and you can start tutorial. It’s a free tutorial that we have for you to start becoming familiar with it right away. A lot of our partners also have tutorials available on our website. So, for example, if you are an application developer… uh… we have a partnership with Cascading, so you can start using the Cascading’s framework… uh… to start building your Hadoop-based application to that.
PiPi: That’s all!Thanks very much!Thanks for my interview.
Jeff Markham:Thank you.