Graph Walking in PostgreSQL for fun and possible profit
When faced with the task of correlating and deduping incoming feeds of data from a large number of sources my colleagues and I decided to treat it as a graph problem. Nodes represent pieces of data, edges represent relationships between nodes, and are weighted by the probability that we think the relationship is real. As more independent sources of data confirm an edge the associated probability is boosted. Given a starting node and a cutoff probability, a profile is the "graph neighborhood" enumerating all nodes reachable from the chosen starting point by a path above the probability cutoff. Because we built this ourselves in Postgres using probabilities and multiplicative weight aggregation was relatively easy. The walk through the graph neighborhood is accomplished via a recursive CTE that tracks nodes and max-seen probability for a path back to the starting point.
Structurally the graph is represented as
- a table of node types
- a table of nodes, with the payload stored as jsonb
- a table of edges, with references to the two related nodes
- a table of edge_sources, that tracks which data sources included an edge relationship
- a table of data sources
At peak the graph contained >1B nodes and ~8B edges gleaned from hundreds of data sources.