【IT168 技术】大数据时代最重要的标志就是数据类型的多样性,各种非结构化数据逐渐成为企业数据的主流。据Gartner预测,企业数据将在五年内增加800%,其中80%是非结构化的,来自团体、社区,以及社交网络的非业务数据会成为这种趋势中的大部分。非结构化数据的爆炸式增长,使传统数据库面临巨大挑战,新型数据管理工具的作用日益凸显。
在这些新型管理工具中,Hadoop和NoSQL是最关键的两类。其中,图形数据库(graphic database)是本文重点讨论的内容。图形数据库是NoSQL的一种,即非关系型数据库,它应用图形理论存储实体之间的关系信息。最常见的是社会网络中人与人之间的关系,这种关系网络用传统关系型数据库存储的效果并不好,其查询复杂、缓慢、超出预期,而图形数据库的独特设计恰恰弥补了这个缺陷。
常见的图形数据库包括Neo4j、FlockDB、AllegroGraph、GraphDB和InfiniteGraph。其中Neo4j是一个用Java实现、完全兼容ACID的图形数据库。数据以一种针对图形网络进行过优化的格式保存在磁盘上。Neo4j的内核是一种极快的图形引擎,具有数据库产品期望的所有特性,如恢复、两阶段提交、符合XA等。
2012年NoSQL Now大会于8月21-23日在美国圣何塞(San Jose)举行,会上Neo科技公司的Andreas Kollegger利用午餐会的时间向大家介绍了Neo4j数据库以及如何利用工具快速建立图形数据库的方法。
该实例选用了NoSQL Now 2012的会议内容作为数据集,如图所示:
首先,新建一个Heroku实例,连接到Neo4j数据库。使用带有neography传感器的ruby脚本,将社区明星Max De Marzi的信息录入数据库中。通过示例数据网站,可以将graph.db目录下载到本地Neo4j服务器。代码如下:
require 'neography'
def neo
@neo ||= Neography::Rest.new("http://localhost:7474")
end
def has_rel(node, dir, type)
res = neo.get_node_relationships(node, dir, type)
return res && res.size > 0
end
def add_talk(slot, title, speakers,audience,tags)
root = neo.get_root()
talk = neo.create_node({:title => title})
slot = neo.create_unique_node(:slots, :slot, slot, { :slot => slot})
neo.create_relationship(:at, talk, slot)
speakers.each do |name, from|
speaker = neo.create_unique_node(:speakers, :name, name, { :name => name})
neo.create_relationship(:presents, speaker, talk)
company = neo.create_unique_node(:companies, :company, from, { :company => from})
neo.create_relationship(:works_at, speaker, company) unless has_rel(speaker, :out, :works_at)
end
tags.each do |name|
tag = neo.create_unique_node(:tags, :tag, name, { :tag => name})
neo.create_relationship(:tagged, talk, tag)
neo.create_relationship(:tag, root, tag) unless has_rel(tag,:in, :tag)
end
who = neo.create_unique_node(:audience, :audience, audience, { :audience => audience})
neo.create_relationship(:for, talk, who)
end
neo.execute_query("start n=node(*) match n-[r?]-m where ID(n)<>0 delete n,r")
[:slots, :speakers, :companies, :tags, :audience].each do |name|
neo.create_node_index(name, :exact, :lucene)
end
add_talk("08:30 AM - 09:00 AM",'The Journey to Amazon DynamoDB: From Scaling by Architecture to Scaling by Commandment',
{'Swami Sivasubramanian'=>'Amazon Web Services'}, 'Technical - Introductory', [ 'Cloud Computing',"NoSQL Architecture and Design"])
add_talk("09:00 AM - 09:45 AM", 'Then Our Buildings Shape Us: A new way to think about NoSQL technology selection',
{'Tim Berglund'=>'GitHub'}, 'Business / Non-Technical', [ 'NoSQL Architecture and Design', "NoSQL Technology Evaluation"])
add_talk("09:45 AM - 10:00 AM",'Create Powerful New Applications with Graphs',
{'Emil Eifrem'=>'Neo Technology'}, 'Business / Non-Technical', [ 'Graph Databases'])
add_talk("10:30 AM - 11:15 AM",'Why and When You Should Use Redis',
{'Josiah Carlson'=>'ChowNow Inc.'}, 'Technical - Introductory', [ 'NoSQL Technology Evaluation'])
...
add_talk("10:30 AM - 11:15 AM",'Intro to Graph Databases 101',
{'Andreas Kollegger'=>'Neo Technology'}, 'Technical - Introductory', [ 'Graph Databases'])
...
add_talk("01:15 PM - 02:00 PM",'Lunch N Learn with Neo Technology and Neo4j',
{'Andreas Kollegger'=>'Neo Technology'}, 'Technical - Introductory', [ 'Graph Databases'])
add_talk("02:15 PM - 03:00 PM", 'Using Graph Databases to Analyze Relationships, Risks and Business Opportunities - A Case Study',
{'Jans Aasman'=>'Franz Inc'}, 'Technical - Introductory', [ 'Graph Databases'])
add_talk("04:15 PM - 04:45 PM", 'High performance graph database using cache, cloud, and standards',
{'Bryan Thompson'=>'SYSTAP, LLC'}, 'Technical - Advanced', [ 'Graph Databases'])
....
add_talk("04:15 PM - 04:45 PM", 'Introducing Hadoop and Big Data into a Healthcare Organization: A True Story and Learned Lessons',
{'Vladimir Bacvanski'=>'SciSpike'}, 'Technical - Intermediate', [ 'Big Data'])
add_talk("04:15 PM - 04:45 PM", 'NoSQL Data Modelling for Scalable eCommerce',
{'Dipali Trivedi'=>'Staples.com'}, 'Technical - Intermediate', [ 'NoSQL Architecture and Design'])
add_talk("05:30 PM - 06:30 PM",'The NoSQL "C Panel"', {"Robert Scoble"=>"RackSpace",
"Bob Wiederhold"=>"Couchbase",
"Dwight Merriman"=>"10gen",
"Emil Eifrem"=>"Neo Technology",
"Jay Jarrell"=>"Objectivity, Inc.",
"Kirk Dunn"=>"Cloudera, Inc."},
"Business / Non-Technical",
["Graph Databases", "Hadoop", "MongoDB"])
Andreas的演讲幻灯片名为《NoSQL Now Zero To Hero》,介绍了图形数据库Neo4j和Cypher。
${PageNumber}为了激发读者的创造力,NEO还准备了更多基于这些数据集的高级查询。通过Neo4j的web云平台,即可进行查询并浏览数据。下图为Neo4j web云平台的截图:
start abk=node:speakers(name="Andreas Kollegger")
return abk;
return properties & id:
start abk=node:speakers(name="Andreas Kollegger")
return abk.name, id(abk);
follow relationships:
start abk=node:speakers(name="Andreas Kollegger")
match abk-[:presents]->talk
return talk.title;
start abk=node:speakers(name="Andreas Kollegger")
match abk-[:presents]->talk-[:at]->slot
return talk.title,slot.slot;
which other talks are during those slots:
start abk=node:speakers(name="Andreas Kollegger")
match abk-[:presents]->talk-[:at]->slot<-[:at]-other
return talk.title,slot.slot, other.title;
group them into a collection, and count them
start abk=node:speakers(name="Andreas Kollegger")
match abk-[:presents]->talk-[:at]->slot<-[:at]-other
return talk.title,slot.slot, collect(other.title) as others, count(*) as cnt;
only see those where there is more than one competing slot
start abk=node:speakers(name="Andreas Kollegger")
match abk-[:presents]->talk-[:at]->slot<-[:at]-other
with talk, count(*) as cnt
where cnt>1
return talk.title,cnt;
slots are connected with a next relationship, show all slots
start n=node(2)
match p=n-[:next*0..]->current
return current.slot;
show the talks at the slot
start n=node(2)
match p=n-[:next*0..]->current<-[:at]-talk
return current.slot, talk.title;
all talks with the tag Graph Databases
start tag=node:tags(tag="Graph Databases")
match tag<-[:tagged]-talk
return talk;
which companies talk about graph databases
start tag=node:tags(tag="Graph Databases")
match tag<-[:tagged]-talk<-[:presents]-speaker-[:works_at]->company
return talk,speaker,company;
which companies speak about graph databases (with a surprise)
start tag=node:tags(tag="Graph Databases")
match tag<-[:tagged]-talk<-[:presents]-speaker-[:works_at]->company
return distinct company.company;