Breaking Down Queries From Graph Analysis Examples
This page offers a much more detailed explanation on how each of the example queries work, breaking them down line by line.
Full query:
MATCH (e:Entity)
WHERE e.name CONTAINS $owner + "/"
MATCH (e)-[r:HAS_DEPENDENCY]->(d:repo)
WITH d, COUNT(r) as ndeps
ORDER BY ndeps DESC
LIMIT 50
RETURN d.name as name, d.score as score, ndeps
Breakdown:
MATCH (e:Entity)
The
MATCH
clause specifies the starting node of the search, which is a node with a URL property that matches the value provided by the user. The colon syntax specifies that the node must have the label "Entity".WHERE e.name CONTAINS $owner + "/"
The
WHERE
clause filters which "Entity" nodes we want to include in this query. For example, for the value of $owner
equal to "ansible" this query will include all entities related to "ansible" GitHub organization.MATCH (e)-[r:HAS_DEPENDENCY]->(d:repo)
This line matches any node labeled "repo" that is connected to any of the chosen "Entity" nodes and has a relationship labeled "HAS_DEPENDENCY" with them. During that process, it remembers the dependency node as
d
and the relationship between two nodes as r
.WITH d, COUNT(r) as ndeps
This line introduces a
WITH
clause, which is used to pass data from the MATCH
clause to subsequent clauses. In this case, it passes the repo
nodes (d
) and the relationships (r
) between them to the next clause. The COUNT
function is used to count the number of relationships (r
) between each repo
node (d
), and the result is aliased as ndeps
.ORDER BY ndeps DESC
This line is used to sort the
repo
nodes (d
) based on the number of relationships (ndeps
). The DESC
keyword specifies that the sorting should be done in descending order.LIMIT 50
This line specifies that the output should be limited to the top 50
repo
nodes (d
) based on the sorting criteria specified in the previous clause.RETURN d.name as name, d.score as score, ndeps
This line specifies what data should be returned as output. It returns the
name
and score
properties of each repo
node (d
), along with the ndeps
count that was calculated in the previous clause. The AS
keyword is used to alias the output property names (d.name
as name
and d.score
as score
).
Full query:
MATCH ()-[r:HAS_DEPENDENCY]->(d:repo)
WITH d, COUNT(r) as ndeps
ORDER BY ndeps DESC
LIMIT 200
RETURN d.url as url, d.score as score, ndeps
Breakdown:
MATCH ()-[r:HAS_DEPENDENCY]->(d:repo)
This line matches any node that has a relationship labeled "HAS_DEPENDENCY" with a node labeled "repo". During that process, it remembers the dependency node as
d
and the relationship between two nodes as r
.WITH d, COUNT(r) as ndeps
This line introduces a
WITH
clause, which is used to pass data from the MATCH
clause to subsequent clauses. In this case, it passes the repo
nodes (d
) and the relationships (r
) between them to the next clause. The COUNT
function is used to count the number of relationships (r
) between each repo
node (d
), and the result is aliased as ndeps
.ORDER BY ndeps DESC
This line is used to sort the
repo
nodes (d
) based on the number of relationships (ndeps
). The DESC
keyword specifies that the sorting should be done in descending order.LIMIT 200
This line specifies that the output should be limited to the top 200
repo
nodes (d
) based on the sorting criteria specified in the previous clause.RETURN d.url as url, d.score as score, ndeps
This line specifies what data should be returned as output. It returns the
url
and score
properties of each repo
node (d
), along with the ndeps
count that was calculated in the previous clause. The AS
keyword is used to alias the output property names (d.url
as url
and d.score
as score
).
Full query:
CALL gds.pageRank.stream('deps-graph')
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS d, score
RETURN d.name as name, score
ORDER BY score DESC
LIMIT 200
Breakdown:
CALL gds.pageRank.stream('deps-graph')
This line calls the PageRank algorithm from the Graph Data Science (GDS) library on the graph with the name 'deps-graph'. It operates in stream mode, meaning that the results are not written back to the graph but are streamed directly to the client.
YIELD nodeId, score
This line specifies the output of the PageRank algorithm, yielding the
nodeId
and the associated score
for each node.WITH gds.util.asNode(nodeId) AS d, score
This line converts the
nodeId
back to a node using the gds.util.asNode()
function, and aliases it as d
. It also carries forward the score
.RETURN d.name as name, score
This line selects the properties we want to return in the final result. We return the
name
property of the node d
and its corresponding PageRank score
.ORDER BY score DESC
This line orders the results by the PageRank score in descending order, so the nodes with the highest scores appear first.
LIMIT 200
Finally, this line limits the number of returned results to 200, ensuring that only the top 200 nodes by PageRank score are included.
Full query:
MATCH (e:Entity {url: $url})
OPTIONAL MATCH p=shortestPath((e)-[*..3]->(d:repo))
WHERE d.score < 3
RETURN p
Breakdown:
MATCH (e:Entity {url: $url})
The
MATCH
clause specifies the starting node of the search, which is a node with a URL property that matches the value provided by the user. The colon syntax specifies that the node must have the label "Entity".OPTIONAL MATCH p=shortestPath((e)-[*..3]->(d:repo))
The
OPTIONAL MATCH
clause searches for the shortest path from the starting node to any node labeled as "repo". The shortestPath
builtin function is used to find the shortest path between the two nodes. The "-[*..3]->" syntax specifies that the path can have a maximum length of three relationships (excluding the "Entity", but including ending "repo" node). The p
variable is used to store the shortest path found by the query, which will be returned in the results.WHERE d.score < 3
The
WHERE
clause filters the ending node of the path based on its "score" property, which must be less than 3 in order for the path to be returned in the results.RETURN p
The RETURN clause specifies what should be returned in the results, which in this case is the shortest path found by the query, stored in the "p" variable.
Full query:
MATCH (e:Entity)
WHERE e.name CONTAINS $owner + "/"
OPTIONAL MATCH p=shortestPath((e)-[*..3]->(d:repo))
WHERE d.score < 3
RETURN p
Breakdown:
MATCH (e:Entity)
The
MATCH
clause specifies the starting node of the search, which is a node with a URL property that matches the value provided by the user. The colon syntax specifies that the node must have the label "Entity".WHERE e.name CONTAINS $owner + "/"
The
WHERE
clause filters which "Entity" nodes we want to include in this query. For example, for the value of $owner
equal to "ansible" this query will include all entities related to "ansible" GitHub organization.OPTIONAL MATCH p=shortestPath((e)-[*..3]->(d:repo))
The
OPTIONAL MATCH
clause searches for the shortest path from the starting node to any node labeled as "repo". The shortestPath
builtin function is used to find the shortest path between the two nodes. The "-[*..3]->" syntax specifies that the path can have a maximum length of three relationships (excluding the "Entity", but including ending "repo" node). The p
variable is used to store the shortest path found by the query, which will be returned in the results.WHERE d.score < 3
The
WHERE
clause filters the ending node of the path based on its "score" property, which must be less than 3 in order for the path to be returned in the results.RETURN p
The RETURN clause specifies what should be returned in the results, which in this case is the shortest path found by the query, stored in the "p" variable.
Full query:
MATCH (r:repo|Entity)
WHERE r.score < 4 AND r.sc_vulnerabilities_score < 2
RETURN r.url, r.score, r.sc_vulnerabilities_score
ORDER BY r.sc_vulnerabilities_score ASC
LIMIT 1000
Breakdown:
MATCH (r:repo|Entity)
The query starts with a
MATCH
statement that specifies the nodes to match. Here, we are matching nodes with a label of "repo" or "Entity" and assigning the matched nodes to the variable.WHERE r.score < 4 AND r.sc_vulnerabilities_score < 2
The
WHERE
statement is used to filter the matched nodes based on certain conditions. In this case, we are filtering nodes where the overall score is less than 4
and the vulnerabilities score is less than 2
.RETURN r.url, r.score, r.sc_vulnerabilities_score
The
RETURN
statement specifies what data should be returned as a result of the query. In this case, we are returning data of interest from the nodes assigned to variable r
.ORDER BY r.sc_vulnerabilities_score ASC
The
ORDER BY
statement is used to sort the results by the vulnerabilities sub-score (sc_vulnerabilities_score
property) in ascending order.LIMIT 1000
With the
LIMIT
statement we are limiting the number of results returned to 1000
.
Full query:
MATCH (e:Entity {url: "github.com/ansible/awx"})
MATCH (e)-[r*..3]->(d:repo)
WHERE d.score < 5 AND d.sc_vulnerabilities_score < 2
RETURN DISTINCT d.url, size(r) as distance, d.score, d.sc_vulnerabilities_score
ORDER BY distance ASC
LIMIT 200
Breakdown:
MATCH (e:Entity {url: "github.com/ansible/awx"})
This line matches the entity with the URL "github.com/ansible/awx". It creates a variable
e
to represent this specific entity in the query.MATCH (e)-[r*..3]->(d:repo)
This line matches all repositories
d
connected to the entity e
through a variable-length path r
, with a maximum length of 3 relationships (hops). The *..3
denotes the range of path lengths from 1 to 3.WHERE d.score < 5 AND d.sc_vulnerabilities_score < 2
This line filters the repositories based on two conditions: the repository's score must be less than 5 and its vulnerability score must be less than 2. Only repositories that meet these criteria will be included in the result.
RETURN DISTINCT d.url, size(r) as distance, d.score, d.sc_vulnerabilities_score
This line returns the unique URL of each repository, the distance (or degree of dependency) between the specific dependency and our repository of interest (calculated as the size of the path
r
), the repository's score, and its vulnerability score.ORDER BY distance ASC
This line sorts the results by the distance (degree of dependency) in ascending order, so that repositories with the shortest distance to our repository of interest appear first.
LIMIT 200
This line limits the result set to the first 200 repositories that match the specified criteria. This can be useful for managing the amount of data returned in the query and improving performance.
Full query:
CALL gds.degree.stream('deps-graph')
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS d, score
RETURN d.name as name, score
ORDER BY score DESC
LIMIT 200
Breakdown:
CALL gds.degree.stream('deps-graph')
This line calls the
gds.degree.stream
function from the Graph Data Science (GDS) library. The function computes the degree centrality of each node in the graph with the name 'deps-graph'. Degree centrality is a measure of a node's importance based on the number of connections (edges) it has with other nodes. It is a simple yet effective measure to determine the most connected nodes in a network.YIELD nodeId, score
The
YIELD
statement is used to return the results of the gds.degree.stream
function. In this case, it returns the nodeId
(the internal ID of a node) and the degree centrality score
for each node.WITH gds.util.asNode(nodeId) AS d, score
This line uses the
gds.util.asNode
function to convert the internal nodeId
back to an actual node object. This is necessary because the GDS library operates on internal node IDs, while we want to return the node's name property in the final result. The WITH
clause passes the node object as d
and its associated score
to the next part of the query.RETURN d.name as name, score
The
RETURN
statement is used to specify the final output of the query. In this case, we're returning the name
property of the node object d
(aliased as name
) and the degree centrality score
.ORDER BY score DESC
This line sorts the returned results based on their degree centrality scores in descending order. This means that nodes with the highest degree centrality scores will appear at the top of the result list.
LIMIT 200
Finally, the
LIMIT
clause is used to restrict the number of results returned by the query. In this case, it limits the output to the top 200 nodes with the highest degree centrality scores.