专访:大数据群雄逐鹿 Hadoop坚持开源？-技术开发专区

专访:大数据群雄逐鹿 Hadoop坚持开源？

作者：皮丽华编辑：皮丽华 2015-02-03 15:47 IT168网站原创

　　【IT168 评论】出身名门雅虎的Hortonworks拥有许多优秀的Hadoop架构师与源代码的贡献者，它们为Apache Hadoop项目贡献了超过80%的源代码。随着各种Hadoop发行版的涌现，Hortonworks如何能一枝独秀，坚持自己百分之百的开源路线呢?本期IT名人堂嘉宾，我们在2015中国Hadoop技术峰会上，邀请到了Hortonworks的 CTO Jeff，对他进行了独家视频访谈。

　　皮皮：Jeff,很高兴认识您!作为Hortonworks的CTO，您在海外非常有名气了，可能对中国人还不太熟悉，能不能介绍下自己?

　　Jeff：当然，我叫Jeff,是Hortonworks亚太地区的技术总监，我们是一家提供开源Hadoop版本的提供商。

　　皮皮：在2015中国Hadoop技术峰会上，您的演讲主题是什么?能否和我们分享下您的主要演讲内容?

　　Jeff：我回顾了2014年的历程，也讲到了这一年重点发生的一些业界大事儿，整个Hadoop生态系统变得越来越成熟，变得越来越重要。在技术层面上，我还谈及了架构、SQL on Hadoop的解决方案等。此外，我还从整个开源项目的角度，预测了2015年Hadoop生态系统的发展趋势。

　　皮皮：当我们谈到大数据，大家会想到Hadoop，于是有人就会很好奇，大数据等同于Hadoop吗?它们之间是什么关系?

　　Jeff：这个问题问得很好，有些人说大数据就是Hadoop，有些人觉得大数据不是Hadoop。毫无疑问，大数据势不可挡，变得越来越流行，这背后有很多原因，其一是因为它纯开源、拥有庞大的用户群;其二是因为有足够成熟的硬件支持，众人拾材火焰高。

　　这就意味着大家可以开始下载、尝试体验、找出一些处理和分析数据的新方法，那在此之前呢，我们没有办法做到这些，所以，我觉得大数据和Hadoop非常相近，可以合二为一。

　　皮皮：这些年来，Hadoop广为人知，几乎每个人都在谈论Hadoop，您是如何看待Hadoop生态系统的未来?

　　Jeff：我非常看好Hadoop生态系统的未来，因为它的开源，因为它拥有不错的硬件支撑，任何规模的企业都可以采用前所未有的方式来采集与分析数据，对我们来讲，我们会将重心转移，从之前关注版本技术的层面，逐步转移到更加广阔的用户应用场景中。

　　不同的行业，比如财务部门，制造业、电信业，怎么利用今天的数据来保持竞争性的优势呢?我们更需要探讨的是Hadoop对每个企业能带来哪些整体价值。对企业来讲，我们不仅要利用数据来淘金，还需要借助数据来更好、更深层次的理解客户、产品和他们的服务。

　　皮皮：您说得非常好，我们今天在谈大数据，经常会提及到3V( volume、variety、 velocity)，Hadoop是怎么来满足这些需求的?

　　Jeff：没错，这是一些与Hadoop、大数据非常相关的常见术语，我考虑更多的是，如何简化Hadoop的版本问题，这样我们可以采用新的方法把数据整合进Hadoop生态系统中。比如最近兴起的storm、spark技术等。

　　皮皮：我常常认为，我们可以借助大数据或者Hadoop技术，把原始数据变成美元或者人民币，但是数据是非常有价值的，有些数据也是特别敏感的，那在数据挖掘中如何保证数据的安全性呢?

　　Jeff：这些技术非常重要，在我们今天看到的Hadoop架构中是真实存在的，不可或缺的。无论是Hadoop、Storm还是Spark,它们的功能都在增长，也会变得更加重要。未来，我们觉得，大家会越来越喜欢探讨Hadoop用户案例等问题。

　　谈到Hadoop，我们可能不再关注于查询性能的优化，安全性问题将会成为Hadoop社区新的关注点。

　　我们Hortonworks收购了一家XA Secure的企业，我们为Apache软件基金会贡献了一个新项目Apache Ranger。这个新项目结合了一些安全性特征，被引入到了Hadoop项目的内核中，为Hadoop的发行版提供了全面的安全套件。在这个安全套件里，不管你把数据存储在Hadoop集群里，还是存储在Hive Table、或者HDFS里，我们都可以使用Apache Ranger项目来确保数据的安全性。

　　皮皮：尽管Hadoop发展得如火如荼，但我们很少人会直接使用Apache的发行版。与此同时，我们注意到越来越多的Hadoop发行版开始涌现了，比如Cloudera、IBM、微软、Hortonworks和Amazon等。能不能和我们谈谈这些发行版在大数据市场的地位?

　　Jeff:你说到，很少人会直接使用Apache发行版，这个没错。事实上，当你在使用Hortonworks 数据平台的时候，你在使用开源的Apache 软件基金会的发行版。我们坚信，开源能够带来最好的价值，开源能够实现最好的创新，开源能够为数据中心引入最好的技术。因此，我们要做的事情都会围绕Apache软件基金会展开。

　　当然，我对其它的发行版也心生敬畏，比如Cloudera Manager、 Cloudera Navigator等，这些项目在开源的世界里发挥着非常重要的作用。而我们一直以来，都坚持将它开源，保持了整个Hadoop生态系统的纯开源的本质。除了Hortonworks，没有其它的企业还能坚持百分之百的开源。

　　皮皮：对于中国从事大数据行业的CTO，您有哪些建议?

　　Jeff:当我们投入Apache软件基金会的研发过程时，当我们在开发Hadoop核心代码时，我们要把已有的技术和资金投入到我们的数据中心里。无论你用的是Oracle、SQL Server、还是Teradata等数据库，我们想做的是将Hadoop整合到已有的技术中，能够将现有技术的价值最大化。因此，我想对CTO说的是，请在你的数据中心中使用Hadoop吧，将Hadoop整合到您的产品中吧，因为它们是开源的。

　　皮皮：对于中国从事大数据行业的个人来讲，您有哪些建议?

　　Jeff:对于个人来讲，我的建议是上官方网站Hortonworks.com 下载Sandbox体验下，这是一个大家都可以使用的虚拟机，它能够免费运行在桌面上，同时支持Windows和Mac操作系统，大家可以在VMware里运行Sandbox，也可以在VirtualBox里运行它。

　　随着企业数据量的增大，数据越来越多样化，Hadoop大展拳脚，很多终端用户可能感觉不到，但他们却在真实使用着Hadoop，他们能觉察到的是自己所使用的数据越来越庞大，越来越复杂。

更多详情内容，请参看以下英文采访原文：

　　PiPi：Hello, Jeff, Nice to meet you!You are well known overseas as Horthonworks CTO, Maybe you are not so familiar with Chinese people.So can you Introduce yourself?

　　Jeff Markham:Sure. My name is Jeff Markham. I am the Technical Director for Asia Pacific for Hortonworks, the providers of the only open-sourced Hadoop distribution.

　　PiPi：So,on China Hadoop Summit,what is your presentation?Could you share your keynote?

　　Jeff Markham:Sure. Today I talk about what happened in 2014 and it was an importance in the Hadoop ecosystem. We talk a little bit about architecture; we talk a little bit about the SQL on Hadoop solutions and then we look at what is coming in 2015 in Hadoop ecosystem in terms of what is available in the pure open source projects.

　　PiPi：When we can talk about big data, we will think of Hadoop, so people may ask， is big data equals to hadoop? What do you think of their relationship?

　　Jeff Markham:That’s a good question. Some people say Big Data is Hadoop, some people say Big Data is… is not Hadoop. We of course see the rise in the popularity of Big Data, very much in parallel with the rise in the popularity of Hadoop. And the reason for that, popularity for Hadoop, the reason for the huge rate of the option is that Hadoop is built on a couple of key things. One is that it’s a pure open source; two is that it runs on commodity hardware. That means anybody can start downloading and experimenting and finding out new ways to process and analyze their data. Today, well, as before, they were never able to do that. So in my opinion, yes ，I think Big Data and Hadoop are so closely related and they can be considered as the same thing .

　　PiPi：Hadoop is so popular during these years,it seems that everybody is talking about Hadoop. How do you see the future of the Hadoop ecosystem?

　　Jeff Markham:Well, I see the rate of the option only increasing, just for the same reason that I mentioned before.The fact that it’s open-sourced, the fact that it runs on commodity hardware, enables any company of any size to start ingesting and analyzing data as they have never been able to do before. So I only see the rate of the option simply increasing, uh…, during this year. Well, I think it’s going to be important this year is we’re gotta move away from how fast can one distribution versus another process’s certain query, and I think we gotta start discussing more broad-level used cases.

　　How can industries, uh…, such as the financial sector, how can industries, such as manufacturing, telcos, take the data that they have today and use it as a competitive advantage? I don’t think we gotta have a lot of the discussion this year on who does what query 5 seconds faster. I think we gotta have a bigger discussion on what is the overall value of Hadoop to each individual organization. How can they use it to not only monetize their data, but to give a better, deeper understanding of their customers, or their product, or their service?

　　PiPi:yeap,good.when we tallk about the big data , we talk about three Vs,such as volume, variety, and velocity, so how can Hadoop work to meet these needs?

　　Jeff Markham:Those were very common terms used to associate with Hadoop and Big Data, the V words, right?what I think is instead of trying to simplify what Hadoop is all about through those particular words, I think once again, we gotta see a shift more in… in terms of simplifying the distributions. Simplifying the distributions, so that we can take that data in… into the Hadoop ecosystem in a number of different ways.Technologies like Storm, technologies like Spark…

　　When this time, uh…, I was here this time last year. None of these technologies were very key in anybody’s presentation last year. Yeah, these are the technologies that are not only key, they’re virtually requirements in all the modern Hadoop architectures that we see today. So, er…, one of the things that I do see is that the components. The different individual features of each distribution are only going to grow in number; each individual component will grow in terms of its own functionality and importance to that particular distribution and… erm… I just think that you gotta see a shift again away from conversation towards the… the details of these components and more towards ease of use… erm… for the operation team, ease of use for the users, the analysts, erm… How do I address my specific used cases? I think that is the conversation that people are gotta have today, erm… going forward in 2015 when it comes to Hadoop.

　　PiPi:In my opinion, big data or Hadoop is used to turn raw data into US dollars or RMB, but data is valuable and sensible, so how can we keep it safe while data mining?

　　Jeff Markham:That’s a great question, you know. Again, I think that really relates to… uh… the issue that I just mentioned before that we are gotta give away from who does what query 10 seconds faster, 5 seconds faster, and I think we are gotta look at the entire distribution holistically particularly in the… the area of security. The area of security has always been contrarily popular believe a huge area of focus in the Hadoop community. What we have done in Hortonworks is we have an acquisition of a company, called XA Secure that we put into open source in Apache software foundation as a project called Apache Ranger. What Apache Ranger does, in combination with a lot of security features that are… are starting to appear in the core Hadoop projects themselves is provide a comprehensive security suite for the Hadoop distribution. So, instead of having different slyvo (07:10) security for each component, instead of having fragmented security across each individual distribution, we have done for the first time, is make available in pure open source, a comprehensive security suite, no matter where your data is stored in the Hadoop cluster that you have, whether it’s in a Hive table, HBase table, HDFS itself. That data can be secured in a four- comprehensive manner using the Apache Ranger project. The Hortonworks Data Platform is the only platform, is the only distribution to feature this, and again, it is pure open source.

　　PiPi:Although hadoop is so popular but seldom people are using the straight Apache distribution.We notice,there are several Hadoop distributions that emerge right now, including Cloudera,\IBM\MIcrosoft\HortonWorks\Amazon .Why do so many distributions emerge now ?How do you see distribution market shaping up?

　　Jeff Markham:Well, I think… First of all, let me answer… answer the first question about the different distributions and different vendors. When you say people are using… are not using the plain… correct… the plain standard Apache distribution. In fact, if you are using the Hortonworks Data Platform, you are using the pure Apache Software Foundation distributions.

　　That’s pure open source. Hortonworks doesn’t believe in… in providing any proprietary software, providing any walk-in toward any customer that might want to use Hadoop. We believe is that open source gives you the best value; gives you the best innovation; gives you the best technology for your data center. So, what we do is we do all our work inside the Apache Software Foundation. We have zero code that we have is the proprietary. So, somebody is using the Hortonworks Data Platform, they are in fact using pure Apache Software Foundation projects. Thus that, secondly, what I would think about the other distributions… I have a lot of respect for these other distributions. They… they do a lot to advance the cost of Hadoop, but a lot of distributions have done besides Hortonworks is take some of the core open source projects and then add proprietary products around it. For example, in Cloudera, we see some products like the Cloudera Manager, Cloudera Navigator, things that are close-sourced proprietary products that are addressed in the open source world, the MBuy (10:00) project, the Apache Falcon project. These are projects that address the used cases in Cloudier Manager, Cloudier Navigator, and more, yet are pure open source. That’s our philosophy. Our philosophy is what we need to do… uh… to advance the Hadoop ecosystem, we need to do it in a pure open source. Otherwise, the distributions become fragmented; otherwise, we have a situation… uh… like we had with Unix. Well, we have many flavors and no one standard because there was no one company to enforce the pure open nature of that project. With Hadoop, that one company that enforces the pure open nature of the entire Hadoop eco-system is Hortonworks and Hortonworks only. There is no other company that ships 100% pure open source, only Hortonworks does that.

　　PiPi:What would you say to Chinese CTO who works on Hadoop and big data?

　　Jeff Markham:My advice is this. When we work in the Apache software foundation putting all our code out there, what we do everyday as an engineering team, is make sure that as we build core Hadoop, we leverage your existing skillset, we leverage your existing investments in the products you have and your data center. So, if you have Oracle, if you have Microsoft SQL Server, if you have Teradata, if you have SAS, Tabblo, SpaFire (11:38), whatever it is, we want you to be able to use Hadoop and integrate with your existing investments and technologies that you have, and be able to leverage your existing skillset. So, my advice to the CTL is to put Hadoop into your data center, integrated with the products that you already have because we are open source. It’s likely that we have a partnership with whatever the technology is that you are using today. So, we want your existing analytic software package that you are using… uh… to be continued to be used today. And the only thing your analysts know is that they are analysing more data and more different kinds of data. That’s the perfect state for us with Hadoop is that the end users may not even know they are actually using Hadoop. All they know is they are using even more data and more different types of data.

　　PiPi: Could you give some advice for those people who want to start using Hadoop?

　　Jeff Markham:For individuals, I’d say the best way to get started is to go to Hortonworks.com and download the Sandbox. The Sandbox is a single virtual machine that people can use… free-use on their desktop right away. They can use it with their VMware; they can use it with VirtualBox; they can use it with HyperBeam; they can use it on Windows; they can use it on Mac. They can get started right away. Download the Sandbox, then follow along a lot of different tutorials — tutorials on how to use Hive; how to use Pig (13:20); what is manual produce (13:21); what is Ranger; how do I start configuring Ranger to secure my entire ecosystem; how do I start using embarry (13:30) to manage, monitor… uh… my cluster. All these things you are able to do with the Hortonworks Sandbox. Free of charge, download it today and you can start tutorial. It’s a free tutorial that we have for you to start becoming familiar with it right away. A lot of our partners also have tutorials available on our website. So, for example, if you are an application developer… uh… we have a partnership with Cascading, so you can start using the Cascading’s framework… uh… to start building your Hadoop-based application to that.

　　PiPi: That’s all!Thanks very much!Thanks for my interview.

　　Jeff Markham:Thank you.

如果您想了解更多嘉宾采访，请关注我们名人堂栏目：http://www.itpub.net/star/

作为国内数据库与大数据领域最大规模的技术盛宴，2015第六届中国数据库技术大会(DTCC)即将于2015年4月16日-18日在北京新云南皇冠假日酒店震撼登场。大会以“大数据技术交流和价值发现”为主题，云集了国内外顶尖专家，特别开设了大数据专场，欢迎大家报名：http://dtcc.it168.com/

关注我们