Graph Representation using Neo4j

4 min readMay 1, 2023

Introduction

Neo4j is a widely utilized database management system renowned for its capability to store and retrieve interconnected data in the form of graphs. By employing the property graph model, it efficiently represents and analyses intricate relationships among data elements. Diverging from traditional relational databases, Neo4j organizes data as nodes and relationships, facilitating the querying and examination of data patterns. The system encompasses Cypher, a query language that simplifies the extraction of insights from graph data. Additionally, it incorporates numerous plugins for graph visualization and data science tasks.

The present blog aims to provide a comprehensive overview of the outcomes derived from conducting diverse queries on a dataset comprising Twitch streamers utilizing Neo4j. Readers can effortlessly access the clustering section directly through the “Clustering Using GDS” link.

Setup

The dataset utilized in this study encompassed a substantial number of entities, consisting of over 150,000 nodes and 6.5 million edges. During the investigation, it was observed that employing Cypher’s LOAD_CSV function to process the data proved to be time-consuming. However, through iterative experimentation, the researchers identified that Neo4j’s admin-import terminal command enabled them to swiftly establish the network configuration in less than a minute.

Remarkably, the program exhibited exceptional performance by loading a network comprising 168,114 nodes and 6,797,557 edges in approximately 20 seconds, while ensuring the preservation of all node properties. Conversely, utilizing the LOAD_CSV approach to solely load the edges would consume more than 40 minutes.

It is noteworthy that the nodes_header.csv and edges_header.csv files accompanying the command contain essential information regarding the data types, in addition to the headers. This additional data-type specification is crucial for ensuring accurate data import, as the entries are considered as strings by default.

Cypher Queries

Neo4j’s query language, Cypher, is widely employed in data science applications due to its remarkable capability to handle intricate data connections and extract valuable insights from large datasets. Designed with a user-friendly and adaptable approach, Cypher empowers data scientists to construct complex queries that are easily comprehensible to others. Its versatility and ease of use make it a preferred choice for data scientists seeking to unlock the full potential of their data. The easiest Cypher query would be

match (n) return n

When executing this command, it will showcase the complete set of nodes within the network. Nevertheless, it is imperative to acknowledge that the Neo4j Browser imposes a constraint on the quantity of nodes that can be exhibited in a single query, albeit this restriction can be modified. Consequently, the maximum number of nodes visible on the screen at any given time will be below this threshold, and it is possible that not all nodes will be perceptible. The top 10 nodes (based on the number of connections) in this dataset may be retrieved by:

match (s)-[]->(t) return s.numeric_id, size(collect(t)) as connections order by connections desc limit 10

To set the criteria as the number of views, the Cypher command would be

match (n) return n.numeric_id, n.views as gamers order by n.views desc limit 10

Clustering, an essential process for organizing nodes with similar characteristics, involves grouping nodes based on specific criteria. Neo4j offers the Graph Data Science (GDS) plugin, equipped with a range of clustering techniques known as “Community Detection.” In our analysis of the dataset, we leveraged GDS’s Louvain community discovery algorithm to establish clusters. Through this approach, we successfully generated 19 distinct clusters.

To perform this clustering analysis, it is crucial to maintain the network in its original “graph” form. This can be achieved by executing the following command:

CALL gds.graph.project.cypher(
 ‘twitch’,
 ‘MATCH (n)
 RETURN
 id(n) AS id,
 n.views AS views’,
 ‘MATCH (n)-[]->(m) RETURN id(n) AS source, id(m) AS target’
 )
 YIELD
 graphName, nodeCount AS nodes, relationshipCount AS rels
 RETURN graphName, nodes, rels

This stores a graph with the name “twitch” and the supplied features in the current runtime.

Louvain clustering can be triggered by

call gds.louvain.write(‘twitch’, {writeProperty:’louvain’})

When executing this command, the Louvain clustering algorithm is employed, resulting in the generation of a node attribute labelled “Louvain” that represents the clustering outcome. To display the clusters individually, the following code can be utilized to convert the node property into a node label.

 match (n)
 call apoc.create.addLabels([id(n)], [toString(n.louvain)])
 yield node
 with node remove node.louvain return node

Graph Representation using Neo4j

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by ShreyaBose

No responses yet

More from ShreyaBose

Prim’s Minimum Spanning Tree

Prim’s Algorithm is a greedy algorithm which works on the approach of going for the best available option out of all. This algorithm uses…

Start Up EdAI: Learning is Fun

EdAI is an educational App working through AI having an automatic evaluation checker with additional features for better learning. The…

Recommended from Medium

100 days of Data Science: Day 54 — Visualizing Networks and Relationships

In today’s post, we’ll explore the basics of network visualization and learn how to create network graphs.

Towards Named Entity Disambiguation with Graph Embeddings

How to distill knowledge from biomedical text combining pre-trained language models with graph machine learning

Neo4j vs CosmosDB, when used as a graph database

Azure Cosmos DB is a multi-model database service that supports various data models, including key-value, column-family, document, and…

Neo4j as a Graph Database

Graph databases are an emerging way to store and query highly connected data, and Neo4j is one of the most popular graph databases…

Neo4j Graph DB

Graph Databases (e.g., Neo4j)

A quick introduction to Neo4j

Over the past years, graph databases have become increasingly popular. Many large companies like Facebook, LinkedIn and Twitter use them…