DRAFT: Why proponents of marriage equality should love graph databases, Part 2: A Reply to 'The Database Engineering Perspective'

Written on 11:37:00 AM by S. Potter

So in part 1 I provided an overview of problems relational databases (RDBMS) have with modelling relationships as well as entities that have datasets that are only partially structured in structure and definition based on my experiences.

This post will:

  1. introduce new concepts utilized in graph databases and specific terminology for Neo databases
  2. review the most flexible schemas that accommodate both same-sex and opposite-sex marriages alike that were presented in the blog post I am responding to, 'Gay Marriage: The Database Engineering Perspective'.
  3. describe how graph databases could overcome the relational database shortcomings
  4. finally I will present a snippet of code (not meant for production use, but merely to demonstrate how we can use Neo4J) that makes it more natural to represent richer relationships between entities (or nodes as Neo4J calls them) that have substance to them (i.e. attributes in this case)

New Concepts in Graph Databases

Before I can explain how graph databases on a conceptual level can overcome most if not all (for most of the time anyway) of the relational databases (RDBMS) shortcomings mentioned in the previous post, we need to be introduced to new concepts and terminology.
There are two basic classes of objects in a graph database, they are:
  • Nodes: a node is basically an entity as RDBMS people (like probably you and I) are familiar with. A node could represent a person, customer, blog post, photograph, video or tweet. It isn't always true, nor is it a good idea to think of graph database concepts only in terms of RDBMS concepts, but we could consider most tables that do not represent an actual relationship or association between other tables as an entity. It is a simplification, so use only in this initial learning phase cautiously. A node can have zero or more properties.
  • Relatioships: a relationship is the concept of an association between two nodes that needs to be represented in your data store somehow. A relationship can also have zero or more properties, just like a node.

The third concept in graph databases is that of a property. It unifies nodes and relationships as both first class citizens of the data store, which is quite unlike relational databases.

In a relational database we try to either reduce an association between two entities/tables to a small handful of cases, which might use references (if supported by that particular vendor and version of RDBMS), create "join tables" or we might force a natural relationship to become a table itself so that we are able to capture pertinent attributes for it. There is no other way of handling relationships. In many cases it might not be an issue to reduce the problem in these ways, however, the more and more interconnected our entities become with new types of relationships it seems (to me) that this method of modelling data quickly becomes problematic.

The following are other useful terms that the Neo graph database family uses, but conceptually most of these will also have a synonym in other graph databases:
  • Traversers: in graph databases "querying" for specific data is not done via declarative query statements and clauses like SQL. Rather we define traversers that are composed of the following elements:
    • the starting node
    • the relationship types needing to be traversed
    • the stop criteria
    • the selection criteria
    • traversing order (e.g. breadth first, etc.)
    We need to know the starting node so the graph database knows where to start the traversal. The traverser will then only move along relationships of the types given by the second element of a traverser, evaluate the selection criteria to know which nodes are relevant tot he "query", then evaluate whether the stop criteria applies to know when traversing should stop.
  • Indexers: In Neo4J there are several indexing utilities backed by a Lucene backend that make it easy to index actual nodes, full text and based on timelines. This allows us to look up relevant starting nodes for traversing.

How can graph databases overcome relational database shortcomings?

Now we will have a quick peak at how the new concepts and ideas from graph databases can resolve the complaints I had about relational databases:
  • In response to A lot of real world data isn't highly structured (only partially): with graph databases we only need to set the properties a specific node actually has. We do not need to fill in lots of nulls in attributes that aren't relevant like we do in RDBMSes. Our graph database "model" interface might be able to add any constraints that are necessary. While, in situations where highly structured data sets really do occur, I prefer having two sets of constraints - those on the relational database level and those in the "model" (application tier) - however, it can quickly become hard to manage when two sets of constraints that should be identical are defined in vastly different ways (languages).
  • In response to Object to relational mapping (ORM) constraints and disjoints: using nodes and relationships conceptually on the graph database layer you have little if any translation between 'objects' that represent nodes or relationships.
  • In response to Weak and inefficient "traversal" support: this isn't a problem with graph databases. Traversing data is much more relevant when dealing with data that is naturally in a network or graph formation already. Relational databases cannot handle networks or graph-like data sets without a lot of workarounds.
  • In response to Maintaining and evolving relational schemas: when the data sets you are dealing with are not highly structured and densely populated evolving strict data schemas do not need to be maintained and with Neo4J's flexible attribute setting/getting, data structures can be flexibly evolved when the data itself changes, however, this raises another issue of application code being able to read the old attributes in the graph database, but I will talk about that in Part 3.

Gay marriage: the graph database solution

Why proponents of marriage equality should love graph databases, Part 1: A Reply to 'The Database Engineering Perspective'

Written on 10:28:00 AM by S. Potter

This is a response to the excellent Gay marriage: the database engineering perspective blog post from November 2008 by Sam Hughs. The only discussion that didn't happen in his blog post was offering a NOSQL alternative to the gay marriage database problem, but I will not hold it against Sam Hughs because it was a comprehensive look at relational database schema designs for modelling marriage!

This response is a little dabble (i.e. not as thorough as the original post) into how graph databases could model monogamous same sex marriages without much problem and also allow for polyamorous marriages quite naturally. My goal here is to demonstrate how graph databases provide a great amount of flexibility without very much work at all, especially when the data application is just as concerned with connections between entities (aka records in RDBMS or nodes in graph databases) than just the entities themselves.

First off I want to look at the inherent problems with modelling real world data in relational databases and how the approach of graph databases can overcome many, if not all, of these problems. Then I will launch into a specific snippet of code to demonstrate modelling marriages that can be between any two consenting adults rather than just one man and one woman. Then I hope to outline how this could be extended to model poly amorous (including polygamous) relationships without much more work. Near the end I will also include a section at the end that discusses shortcomings of the graph database approach as well for completeness. My discussion will be focused on using Neo4J as it is the only comprehensive graph database I am aware of that is open source.

Update (2010-03-13): Boris (in the comments) mentioned that there is another open source graph database called HyperGraphDB

Update (2010-04-03): Johannes (via email) told me about InfoGrid, which looks like a very interesting alternative to Neo4J with a slightly different way of doing things.

What is wrong with relational databases?

For the last 14 years I have been using, designing, maintaining and administering relational databases in some capacity as a software programmer, developer, engineer and now as an applications architect. During that time I have found the following problems with using relational databases:
  • A lot of real world data isn't highly structured (only partially): this is probably the biggest problem with using relational databases (for everything) the way they are supposed to be. Sure you can add a blob, clob or text field to a table and add an arbitrary structure of data that depends on the record, which can be parsed by the application, but that would be a violation of all things relational and you would be paying a heavy price for doing this in a relational database depending on how big these "blobs" generally are. Not to mention you miss out on query-ability, which is something relational databases do well on highly structured datasets with the relevant indexes defined and a decent attempt at normalization. Note: I think relational databases are a fine thing when utilized to model data that truly is highly structured in the wild where the entities, not the relationships matter the most.
  • Object to relational mapping (ORM) constraints and disjoints: For almost a decade I have been using ORM libraries such as TOPLink (Java), Persistence (C++), Hibernate (Java), ActiveRecord (Ruby), DataMapper (Ruby), SQLObject (Python) and others. They each had their own specific problems at times, but they also possessed a set of common problems that resulted in creating a far from seamless integration between the object oriented layer (that a vast number of business systems and web applications are currently written with today) and the distinct properties of a relational database. These common object to relational constraints and disjointedness is not an ORM tool issue, rather it is a problem of trying to shove a round pin into a square hole (paraphrasing a comment made in page 3 of 'The Neo Database: A Technology Introduction').
  • Weak and inefficient "traversal" support: When I first started writing rich domain models in the nineties (that is last century for all you whipper-snappers) the focus of design was on the actual entities not relationships between the entities. Sure occasionally you had to model a relationship as an "entity", but for the most part you could do your best to reduce your domain relationships as much as possible to a combination of contain a(n) (aka belongs_to in ActiveRecord), contained by (aka has_one in ActiveRecord), one-to-many (aka has_many in ActiveRecord) and/or many-to-many (aka has_and_belongs_to_many in ActiveRecord) as much as possible because you can't attach attributes to relationships in a relational database unless you attempt to make it an entity. If you are interested in deep and/or rich object graphs, however, you have to pay the penalty in any RDBMS with multiple joins, which are an expensive operation. Even with all the right indices defined and queries optimized you will find more than three levels of indirection (aka JOINs) will take a while on even a medium sized dataset. So as application developers, we are forced to add lazy loading and/or customized eager loading logic into our rich object domain layer for a number cases where performance matters, which in my view pollutes business logic and makes the code less maintainable going forward. This is far from ideal. It is also not ideal that data set traversal patterns change with new the introduction of new features, so often you will need to fork lazy loading logic in the business logic layer to satisfy you new requirements. Adding a lot more complexity to manage in the application tier.
  • Maintaining and evolving relational schemas: modifying relational database schemas has always been a little tricky at best. Today I take advantage of using ActiveRecord's "migrations" that loosely orders (by creation timestamp) a set of relational schema modifications to run (and we wrote a simple extension to wrap them in a transaction to save our sanity - why wasn't that the default to begin with? anyway...). Even though there is a method to the madness now, it is still a little crazy, especially when it comes time to run these migrations on the production server. I always miss a heart beat or two when I run the rake db:migrate RAILS_ENV=production command (yes, even after taking a snapshot).

It has been fun thinking out loud, but I have stuff to do today. Hey, I sometimes have a life, honest!:) So I have made an outline of what is to come in the next parts of this topic. I have already written some (bad) Java code to model same-sex and opposite-sex marriages consistently using a graph database (in this case Neo4J), but I plan on offering a Python snippet too, since I really can't stand the look of Java any more and the API doesn't make the case of Graph Databases for those coming from more terser feeling languages like Python, Ruby, Haskell, Erlang, Javascript (well if you use sane APIs like jQuery at least).

What is coming in Part 2?

  • How can graph databases overcome relational database shortcomings?
  • Gay marriage: the graph database solution

What is coming in Part 3?

  • Poly amorous marriage using a graph database
  • Graph databases: problems to watch out for

How agile practices improve code review

Written on 9:27:00 PM by S. Potter

On a mailing list recently someone posed a question about how to incorporate code reviews into an "agile methodology". While I had questions about the way the question was posed it got me thinking about how agile practices make code review happen more often, more organically and more transparently. Having different members of the team review code at different times during the development process on agile projects is both explicit, but also implicit.

Values

Let us back up a little here. First off agile is at its core a set of four ideals. Prefer:

  • individuals and interactions over processes and tools
  • working software over complete documentation
  • customer involvement over contract negotiation
  • adapting to changing needs over completing a stale plan

Principles/Tenets

From these ideals there are about 10 or so principles or tenets that are most commonly discussed among agile practitioners from most of the main camps. This is where it starts getting controversial as some agile camps acknowledge some of these principles and others do not necessarily, so here are just a few that I want to talk about in relation to increasing code review on agile projects:

  • Collective team ownership of product: meaning everyone on the team owns the product and nobody on the team is exclusive to just one or a couple of components of the product.
  • Embracing change: instead of continuing down a set path determined long ago that no longer applies, the team needs to adapt to customer's changing requirements at set (and regular) intervals (these might be called iteration planning meetings/negotiations, for example). This has various implications throughout the iteration/sprint.
  • Verify expectations continuously: meaning whenever possible, without impeding development, verify that customer expectations will be satisfied and no new additions have negative impacts.
  • Customers set priorities: instead of technical staff or outsourcing companies specifying technical priorities, customers specify business/feature priorities.

Practices

This is the point where we start seeing the translation into agile practices, many of while you will start to recognize:

  • Pair programming which arose from both collective team ownership and embracing change tenets.
  • Test-first practice came about mostly to satisfy embracing change, but also verify expectations continuously
  • Assigning new stories independently of previous ones. This means that no developer or pair has a niche component or product area and is used to satisfy the first tenet listed above.
  • Continuous integration that helps satisfy the verify expectations continously principle.
  • No fear refactoring, which is only advisable if your test/spec suite has good enough coverage,. This means you will likely revise poorly constructed code on agile projects vs non-agile projects more often and have the chance to do it more often too.

Obviously with the pair programming practice that some agile methods promote, you have another set of eyes reviewing code as it is written. This catches errors closer to the time it was written, which is a big time (and money) saver.

Applying a test-first approach means that you start out defining custom expectations of the code first before writing the actual code. This is reviewing, perhaps not code in the traditional way, but the requirements, which in my view is even more productive usually.

When a project manager assigns new stories to a developer or a pair without confining him/her/them to specific areas of the codebase, this increasing the transparency of the codebase and exposes hidden dragons to be exposed before too long and hopefully before being launched into production (but you never know). Again this increasing the number of eye balls on parts of the codebase.

With continuous integration you can incorporate metric tools to capture some metrics of the code so that parts of the code that are obviously in need to technical refactoring get highlighted earlier rather than later. Increasing transparency of the codebase, which is always a goal of code reviews.

Hopefully you can see where I am going with this and not think about code reviews in the traditional top-down way where every month or quarter a piece of code is reviewed from each developer in a formal way.

There are a number of different agile (and common) software practices that could be used, I can think of 4 more off the top of my head that improves codebase visibility across more of the team.

So Long, Farewell, Auf Wiedersehen, Goodbye...to Ruby & Rails (sort of)

Written on 6:37:00 PM by S. Potter

It's been obvious to many of my close geek comrades that for the last 18-22 months I have been getting more and more frustrated with the Ruby community and the "quality" of products in this space. The quality of people in the Ruby field has been tending towards the average PHP developer (PHP is a gateway drug for programming - i.e. not something to aspire to) and the recent Ruby mindset (as a whole though some dark corners still look interesting) has not been about doing things differently, but rather blending in with mainstream ideas. For some that is exciting as it means not fighting management just to get to use Ruby in the enterprise. I understand, I really do. Fighting management on technical platform is never fun.

I think it is safe to say the direction of the major libraries and frameworks that are prominent in the Ruby landscape will suffer from very similar conceptual limitations as the Java, J2EE and JEE frameworks that DHH once ridiculed (for good reason). Of course, this is all to be expected in the usage lifecycle of a programming language. In fact, I believe I predicted that I would get bored of Ruby in a blog post in 2006 within the next few years though it doesn't make this any easier.

Thanks mates!

I wish all friends in the Ruby and Rails worlds luck (I think you'll need it) and I thank (ex-)coworkers and project collaborators that have taught me something new for all the intellectual stimulation you have provided me. I am sad to be saying goodbye to those of you.

But...

I look forward to grazing new pastures. I will be working on my next generation financial products using the right tool for each primary job. Currently most of the Finsignia Platform is quite back-end and different pieces could take advantage of Haskell's concurrency and/or Erlang distribution. Yet there will inevitably need to be (at some point) rich web applications for UI, so (J)Ruby and Rails I might come knocking on your door again, but for now, adiĆ³s.