Cyberfame
Search
K

Breaking Down Queries From Graph Analysis Examples

This page offers a much more detailed explanation on how each of the example queries work, breaking them down line by line.

"Centrality in a Supply Chain of Organization" Breakdown

Full query:
MATCH (e:Entity)
WHERE e.name CONTAINS $owner + "/"
MATCH (e)-[r:HAS_DEPENDENCY]->(d:repo)
WITH d, COUNT(r) as ndeps
ORDER BY ndeps DESC
LIMIT 50
RETURN d.name as name, d.score as score, ndeps
Breakdown:
MATCH (e:Entity)
The MATCH clause specifies the starting node of the search, which is a node with a URL property that matches the value provided by the user. The colon syntax specifies that the node must have the label "Entity".
WHERE e.name CONTAINS $owner + "/"
The WHERE clause filters which "Entity" nodes we want to include in this query. For example, for the value of $owner equal to "ansible" this query will include all entities related to "ansible" GitHub organization.
MATCH (e)-[r:HAS_DEPENDENCY]->(d:repo)
This line matches any node labeled "repo" that is connected to any of the chosen "Entity" nodes and has a relationship labeled "HAS_DEPENDENCY" with them. During that process, it remembers the dependency node as d and the relationship between two nodes as r.
WITH d, COUNT(r) as ndeps
This line introduces a WITH clause, which is used to pass data from the MATCH clause to subsequent clauses. In this case, it passes the repo nodes (d) and the relationships (r) between them to the next clause. The COUNT function is used to count the number of relationships (r) between each repo node (d), and the result is aliased as ndeps.
ORDER BY ndeps DESC
This line is used to sort the repo nodes (d) based on the number of relationships (ndeps). The DESC keyword specifies that the sorting should be done in descending order.
LIMIT 50
This line specifies that the output should be limited to the top 50 repo nodes (d) based on the sorting criteria specified in the previous clause.
RETURN d.name as name, d.score as score, ndeps
This line specifies what data should be returned as output. It returns the name and score properties of each repo node (d), along with the ndeps count that was calculated in the previous clause. The AS keyword is used to alias the output property names (d.name as name and d.score as score).

"Centrality in Global Supply Chain" Breakdown

Full query:
MATCH ()-[r:HAS_DEPENDENCY]->(d:repo)
WITH d, COUNT(r) as ndeps
ORDER BY ndeps DESC
LIMIT 200
RETURN d.url as url, d.score as score, ndeps
Breakdown:
MATCH ()-[r:HAS_DEPENDENCY]->(d:repo)
This line matches any node that has a relationship labeled "HAS_DEPENDENCY" with a node labeled "repo". During that process, it remembers the dependency node as d and the relationship between two nodes as r.
WITH d, COUNT(r) as ndeps
This line introduces a WITH clause, which is used to pass data from the MATCH clause to subsequent clauses. In this case, it passes the repo nodes (d) and the relationships (r) between them to the next clause. The COUNT function is used to count the number of relationships (r) between each repo node (d), and the result is aliased as ndeps.
ORDER BY ndeps DESC
This line is used to sort the repo nodes (d) based on the number of relationships (ndeps). The DESC keyword specifies that the sorting should be done in descending order.
LIMIT 200
This line specifies that the output should be limited to the top 200 repo nodes (d) based on the sorting criteria specified in the previous clause.
RETURN d.url as url, d.score as score, ndeps
This line specifies what data should be returned as output. It returns the url and score properties of each repo node (d), along with the ndeps count that was calculated in the previous clause. The AS keyword is used to alias the output property names (d.url as url and d.score as score).

"PageRank Algorithm for a Global Supply Chain" Breakdown

Full query:
CALL gds.pageRank.stream('deps-graph')
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS d, score
RETURN d.name as name, score
ORDER BY score DESC
LIMIT 200
Breakdown:
CALL gds.pageRank.stream('deps-graph')
This line calls the PageRank algorithm from the Graph Data Science (GDS) library on the graph with the name 'deps-graph'. It operates in stream mode, meaning that the results are not written back to the graph but are streamed directly to the client.
YIELD nodeId, score
This line specifies the output of the PageRank algorithm, yielding the nodeId and the associated score for each node.
WITH gds.util.asNode(nodeId) AS d, score
This line converts the nodeId back to a node using the gds.util.asNode() function, and aliases it as d. It also carries forward the score.
RETURN d.name as name, score
This line selects the properties we want to return in the final result. We return the name property of the node d and its corresponding PageRank score.
ORDER BY score DESC
This line orders the results by the PageRank score in descending order, so the nodes with the highest scores appear first.
LIMIT 200
Finally, this line limits the number of returned results to 200, ensuring that only the top 200 nodes by PageRank score are included.

"Isolating High-Risk Dependencies" Breakdown

Full query:
MATCH (e:Entity {url: $url})
OPTIONAL MATCH p=shortestPath((e)-[*..3]->(d:repo))
WHERE d.score < 3
RETURN p
Breakdown:
MATCH (e:Entity {url: $url})
The MATCH clause specifies the starting node of the search, which is a node with a URL property that matches the value provided by the user. The colon syntax specifies that the node must have the label "Entity".
OPTIONAL MATCH p=shortestPath((e)-[*..3]->(d:repo))
The OPTIONAL MATCH clause searches for the shortest path from the starting node to any node labeled as "repo". The shortestPath builtin function is used to find the shortest path between the two nodes. The "-[*..3]->" syntax specifies that the path can have a maximum length of three relationships (excluding the "Entity", but including ending "repo" node). The p variable is used to store the shortest path found by the query, which will be returned in the results.
WHERE d.score < 3
The WHERE clause filters the ending node of the path based on its "score" property, which must be less than 3 in order for the path to be returned in the results.
RETURN p
The RETURN clause specifies what should be returned in the results, which in this case is the shortest path found by the query, stored in the "p" variable.

"Risk Assessment for GitHub Organization" Breakdown

Full query:
MATCH (e:Entity)
WHERE e.name CONTAINS $owner + "/"
OPTIONAL MATCH p=shortestPath((e)-[*..3]->(d:repo))
WHERE d.score < 3
RETURN p
Breakdown:
MATCH (e:Entity)
The MATCH clause specifies the starting node of the search, which is a node with a URL property that matches the value provided by the user. The colon syntax specifies that the node must have the label "Entity".
WHERE e.name CONTAINS $owner + "/"
The WHERE clause filters which "Entity" nodes we want to include in this query. For example, for the value of $owner equal to "ansible" this query will include all entities related to "ansible" GitHub organization.
OPTIONAL MATCH p=shortestPath((e)-[*..3]->(d:repo))
The OPTIONAL MATCH clause searches for the shortest path from the starting node to any node labeled as "repo". The shortestPath builtin function is used to find the shortest path between the two nodes. The "-[*..3]->" syntax specifies that the path can have a maximum length of three relationships (excluding the "Entity", but including ending "repo" node). The p variable is used to store the shortest path found by the query, which will be returned in the results.
WHERE d.score < 3
The WHERE clause filters the ending node of the path based on its "score" property, which must be less than 3 in order for the path to be returned in the results.
RETURN p
The RETURN clause specifies what should be returned in the results, which in this case is the shortest path found by the query, stored in the "p" variable.

"Querying by a Score Component" Breakdown

Full query:
MATCH (r:repo|Entity)
WHERE r.score < 4 AND r.sc_vulnerabilities_score < 2
RETURN r.url, r.score, r.sc_vulnerabilities_score
ORDER BY r.sc_vulnerabilities_score ASC
LIMIT 1000
Breakdown:
MATCH (r:repo|Entity)
The query starts with a MATCH statement that specifies the nodes to match. Here, we are matching nodes with a label of "repo" or "Entity" and assigning the matched nodes to the variable.
WHERE r.score < 4 AND r.sc_vulnerabilities_score < 2
The WHERE statement is used to filter the matched nodes based on certain conditions. In this case, we are filtering nodes where the overall score is less than 4 and the vulnerabilities score is less than 2.
RETURN r.url, r.score, r.sc_vulnerabilities_score
The RETURN statement specifies what data should be returned as a result of the query. In this case, we are returning data of interest from the nodes assigned to variable r.
ORDER BY r.sc_vulnerabilities_score ASC
The ORDER BY statement is used to sort the results by the vulnerabilities sub-score (sc_vulnerabilities_score property) in ascending order.
LIMIT 1000
With the LIMIT statement we are limiting the number of results returned to 1000.

"Querying by a Score Component in a Specific Repository" Breakdown

Full query:
MATCH (e:Entity {url: "github.com/ansible/awx"})
MATCH (e)-[r*..3]->(d:repo)
WHERE d.score < 5 AND d.sc_vulnerabilities_score < 2
RETURN DISTINCT d.url, size(r) as distance, d.score, d.sc_vulnerabilities_score
ORDER BY distance ASC
LIMIT 200
Breakdown:
MATCH (e:Entity {url: "github.com/ansible/awx"})
This line matches the entity with the URL "github.com/ansible/awx". It creates a variable e to represent this specific entity in the query.
MATCH (e)-[r*..3]->(d:repo)
This line matches all repositories d connected to the entity e through a variable-length path r, with a maximum length of 3 relationships (hops). The *..3 denotes the range of path lengths from 1 to 3.
WHERE d.score < 5 AND d.sc_vulnerabilities_score < 2
This line filters the repositories based on two conditions: the repository's score must be less than 5 and its vulnerability score must be less than 2. Only repositories that meet these criteria will be included in the result.
RETURN DISTINCT d.url, size(r) as distance, d.score, d.sc_vulnerabilities_score
This line returns the unique URL of each repository, the distance (or degree of dependency) between the specific dependency and our repository of interest (calculated as the size of the path r), the repository's score, and its vulnerability score.
ORDER BY distance ASC
This line sorts the results by the distance (degree of dependency) in ascending order, so that repositories with the shortest distance to our repository of interest appear first.
LIMIT 200
This line limits the result set to the first 200 repositories that match the specified criteria. This can be useful for managing the amount of data returned in the query and improving performance.

"Degree Centrality in a Global Supply Chain" Breakdown

Full query:
CALL gds.degree.stream('deps-graph')
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS d, score
RETURN d.name as name, score
ORDER BY score DESC
LIMIT 200
Breakdown:
CALL gds.degree.stream('deps-graph')
This line calls the gds.degree.stream function from the Graph Data Science (GDS) library. The function computes the degree centrality of each node in the graph with the name 'deps-graph'. Degree centrality is a measure of a node's importance based on the number of connections (edges) it has with other nodes. It is a simple yet effective measure to determine the most connected nodes in a network.
YIELD nodeId, score
The YIELD statement is used to return the results of the gds.degree.stream function. In this case, it returns the nodeId (the internal ID of a node) and the degree centrality score for each node.
WITH gds.util.asNode(nodeId) AS d, score
This line uses the gds.util.asNode function to convert the internal nodeId back to an actual node object. This is necessary because the GDS library operates on internal node IDs, while we want to return the node's name property in the final result. The WITH clause passes the node object as d and its associated score to the next part of the query.
RETURN d.name as name, score
The RETURN statement is used to specify the final output of the query. In this case, we're returning the name property of the node object d (aliased as name) and the degree centrality score.
ORDER BY score DESC
This line sorts the returned results based on their degree centrality scores in descending order. This means that nodes with the highest degree centrality scores will appear at the top of the result list.
LIMIT 200
Finally, the LIMIT clause is used to restrict the number of results returned by the query. In this case, it limits the output to the top 200 nodes with the highest degree centrality scores.