Without data, you are just a single person with an opinion

Data and KnowledgeWithout data, you are just a single person with an opinion - W. Edwards Deming. Have you thought which stars are closer to the Sun than others? Or how many galaxies or superstars in the Milky Way? There is a close relationship between data collections by connecting close pairs of points by edges as a graph. Interpreting questions and examples like this as points in space offers a way to think about how stars could link with each other connect like a social network. Stars are just an analogy of entities and relationships for knowledge structuring.

The Sun is about one-third the distance from the Milky Way galaxy's center to its outer edges. It is in a smaller spiral arm, between two broad components, called the Orion Arm. This chapter will cover an overview of graph application, what it is, and why it is gaining momentum. Companies like Google invented the concept of knowledge base used by services to improve its search engine results with information gathered from various sources. We will discuss infrastructure for knowledge graphs (KGs) and optimal ways to store triples in them. A triple is organized as a subject, the predicate, and its object. The subject and object are entities participating in a relationship defined by the predicate. "The Milky Way located in Orion Arm," and we could split triple like this:

Subject: Milky Way, Predicate is located, Object: Orion Arm.

We will also learn how to represent nodes' attributes as node labels and the relationship between objects as edges. On top of this, more processing with NLP (natural language processing) techniques, machine learning (ML), and graph theory (GT). We will explore the tools and skills required to build a wide variety of powerful NLP apps to help computers understand humans. Furthermore, creating knowledge base systems using RDF data with non-relational and relational databases will also be covered. You will get a formal introduction to ontology and differentiate between storing and relating information in relational and knowledge databases. This chapter will introduce the types and algorithms in data science. Why are graph algorithms the analysis of real-world systems from predicting the spread of the COVID-19 Coronavirus pandemic? This chapter is about analytics, architecture, research, and engineering for real-life applications. Most computer scientists are exposed to graph theory through their courses in discrete structures or algorithms. Understand how we can learn from data with ontology on Tiger Graph. Next, we will cover the following topics: Introduction to graphs, Graph theory, Basic graph algorithms: centrality, pathfinding, page rank, edges and relationships, Representation of graph data, and Ontology in different data science fields. Technical Requirements: To follow the examples in this chapter, you need to have the community edition of Visual Studio 2019. You should also have .NET Core 3.1 SDK or higher to run the examples. You should also have .NET Core 3.1 SDK or higher to run:

https://github.com/PacktPublishing/Designing-and-Scaling-Graph-Applications-/tree/master/ch1

A graph is a data structure like a bag of entities in pairs of related nodes. Graphs have been all over for centuries, so this topic is relevant now more than ever before because when the information was in the knowledge base, it did not much need to glance at it more deeply and see the full picture with connections. The shift in a technological breakthrough in AI space bases on the most efficient ways to store data in the last decade cheaply, but I strongly feel that advancements are possible with graph usage as a next improvement wave.

The cleaning and transforming tasks are solved with the cloud, but now it is more about getting insights and value out of the data by looking at the higher-level picture and context. Graphs are not merely valuable as structured knowledge sources: they play a significant role in modern machine learning. As data becomes increasingly interconnected and systems sophisticated, it is essential to use our data's fruitful and evolving relationships. MarketsandMarkets estimates the conversational AI market size will improve from $4B in 2019 to USD 16B by 2025, at an annual growth of 30%. MarketsandMarkets Research Pvt. Ltd. MarketsandMarkets is a startup that is ten years old and raised more than $56M for different trends and global market research, according to Crunchbase.

Have you considered how the bot or assistant industry is building them? The sector of assistances and bots is a platform that includes conversational modules and templates needed to generate answers from the knowledge make on top of personalized conversations and analytics from the data. When we start a discussion with the bot framework, we give it more and more context with every phrase we use, and it is building such contexts over time with the history for each user. The voice assistants like Alexa, Cortana, and bots fed with the graph's expertise to answer natural queries. The more knowledge connections are related to the question, the better and engaged in meaningful dialogs by learning with them over time.

The popularity of more complex queries increased in search engines. With more and more complexity, Google has figured users and how they use the search engine. The company acquired online database provider Metaweb for about $1 billion. The acquisition was a surprise because of the price for everyone. Nobody realized how important this asset will play later and why Google wanted to pay for Freebase's knowledge base. It validated the substantial collaborative value of knowledge sources based on data compiled by its society supporters. Furthermore, it was an online compilation of structured data gathered from numerous sources, including personal, user-submitted wiki contributions.

In December 2014, the Freebase team publicly declared the internet site. Google provided an update in December 2015 to discontinue the Freebase API and widget three months after a suggested widget replacement was launched in early 2016. Data is gradually becoming a strategic asset in the world. For example, LinkedIn and GitHub acquisitions by Microsoft gave measurement to the value of data that solves complex problems. LinkedIn's price was about $26 Billion on an estimated $1 Billion in revenue, and the GitHub acquisition set the price at $7.8 Billion on an estimated $300 Million in revenue. Both acquired corporations are own networks with 27 times multiplier on the amount. Today, Google has all email accounts from Gmail's, YouTube videos, and queries from searches.

On the other hand, Amazon and FedEx have global supply chains, economies of scale, and transportation logistics. Verizon and AT&T are building one of the world's most massive knowledge bases about telecommunication stack. Netflix, Hulu, and HBO are modeling entertainment graph content, so you probably see how it is all going, right? Such examples show you the competitive advantage by leveraging graphs as a robust intellectual property of the company. Another great point to know about the intelligence behind graphs and knowledge is the semantic web has a power-law distribution. For example, 20% of the people will connect to all the remaining 80% of social networks. In such cases, if we want to keep cache for our data, it is smart to include the most popular nodes in it for fast lookup, and positively related nodes with Power Law and "80/20 rule" will be self-controlled 80% of the fortune. If you are in connection with top entities, you will be rated higher in search of the social network, or if your website has links from top sites, the same applies.

The graph's entities are nodes or vertices, and the pointers between them are relationships or edges. If the data is sparse or dense could be known from the relationship to node ratio. A path sandwiched between two nodes in the graph tells if it is connected or disconnected. Most of the open-source graph systems are using edges and relationships organized in the Triples format.

DBpedia is a KG first available in 2007 and based on infoboxes from Wikipedia pages. DBpedia Ontology (specified in Web Ontology Language - OWL). The October 2016 announcement contains 13B RDF triples, and ontology contains ~760 classes and ~3000 relationships or so-called predicates. Triple is usually a subject, object, and predicate, but most of the time, it is not enough, and we must have more properties for context. Below are a few different explanations on graph types: if there are domain-specific values on connections or nodes, it is weighted versus an unweighted graph. If we could explain the class by looking at paths, start and end at the identical node, we could decide if they could be cyclic versus acyclic.

Any knowledge could represent itself with a triple. Your team should figure out how to map it to the ontology with a particular resource description or manual mapping. Each triple in the graph has a subject, predict and object, but sometimes it is more contexts and different internal IDs. Entities in the KG are related to others over edges. Some of the direction matters a lot with directed graphs. In our case, we will be using the knowledge base and knowledge graph triples as the same. When I look at the latest discussions on the conferences, I see many learning advances with knowledge graphs. It is called KRL or knowledge graph embeddings (KGE) by mappings for each graph entity. For instance, they are tasks like classification or entity recognition, but also semantic knowledge extraction. Such technology is used for recommendation engines. There also question answering jobs in bots, assistants, or AI search algorithms. Flavors and types of graphs are more comfortable to grasp with visualizations. The challenge to understand types of monopartite, bipartite, and k-Partite graph data structure types. Bipartite is non-standard because data contains multiple nodes and relationship types, so the monopartite graph is the most common. It is unusual because most of the graphs have only one class – monopartite. Our next section will cover graph networks' basics and find the length of edges between them. We will encompass the history of structures, distances, and popular weight calculations in the following area.

To avoid confusion, we will interchangeably call the graph's node or object as an entity and vertex or relationship as an edge. Graph networks positively correlated with Network Science. Furthermore, network science is an academic field passionately rooted in graph theory concerned with mathematical models of the connections.

Shortest Path

The main difference between path and length is that path is a sequence of edges. A length is the number of edges in the course. But the distance is quite close to this because it is the distance between two edges with the shortest route. When you think of distances in the math field's data structures, the typical length among two vertices in several edges in the shortest path between them is called a geodesic. The quickest route is common interview questions in big corporations like Google, Facebook, ad Amazon. The issue is to find a way for the two vertices or nodes in the graph to connect with the smallest sum of weight edges. The concept of the path or distance is basic in math. It is mostly needed to determine how far objects are apart, as you could see on the diagram on the left side where we have no available walks from a to b.

We could define d(a, b) = ∞ because there is no solution to the problem. On the other hand, the graph on the right, which has weights as suggested in the distance calculations, illustrates a more terrible pain. There is a cycle inside x,y,z,x, and a negative value on the x to z, so there is no minimum weight path from a to b anymore. At the same time, no well-defined distance d(a, b). If you are inside Amazon Prime and want to send a shipping container from a to b, you could certainly be negative if we have to bring the empty box back. It is a cool scenario in realtime.

Matrix for words using the cost of substitution as one and the cost of deletion or insertion as 0.5

Another concept to pick up is to measure similarity in the graphs in terms of pattern analysis. It is also called graph edit distance (GED). GED could calculate a minimum number of edit operations to transform graphs from one to another. This algorithm will include inserts and deletes of vertexes and edges but also label changes.  The last but not least example is Levenshtein distance.

It is used in IT and linguistics. Think about all the spell checkers systems as an example that is used inside the search engines. It is intelligent to understand if users mistyped something in the query and provide the relevant result on the most common typos in terms of the keys' keyboard location. Suppose some keyboard keys are closer to each other. In that case, it is more probable that when we calculated Levenshtein distance in terms of discrepancy among two sequences of elements, it is a typo by the user. Such distance was named after mathematician Vladimir Levenshtein, who invented it in 1965.

We identified a distance between two sequences based on the sum of tasks needed to switch between rows. The alignment we developed between the two graphs reveals insertions, deletions, and similarities of different vertices' edges. By giving the weight of each kind, we can sum them to build the distance. We utilize the same weights we defined to compute the alignment between two paths from both graphs.

Most Popular Graph Edit distances

The most useful algorithm for computing this is an A*-based algorithm, and there are other sub-optimal procedures. There are multiple options to weight distance with minimum cost path or the shortest weighted path. The shortest weighted paths are prevalent optimization problems in computer science. These issues tend to be multifaceted, complex optimization problems because they try to combine more than one source of information into a cost metric for minimization.

Graphs take many forms, like random networks with no hierarchies. Computer architecture defines the physical and consistent design of the software, hardware, and data transmission. Indeed, the Internet is a significant graph with nodes connected. We covered high-level details on the distances, and one of the approaches is super essential to explain for NLP and graph-related projects. It is called the Cosine similarity.

If you are in the NLP space, you probably familiar with the formula above. The idea is a simple calculation of the vectors' dot product divided by the development of the vector's length. If you like geometry explanation more, it is the angle between n-dimensional vectors. There are many other popular and useful distance algorithms like Jaccard, Pearson, Euclidean Distance, Approximate Nearest Neighbors (ANN), and Overlap similarity that we will cover in the next chapters of the book.

You might be wondering why we have so many distances and where to use cosine similarity? The answer to this question is text or sensor data. The properties of the instances make so that the weights might be larger without meaning anything different. Sensor values captured in various lengths in time between models are great for cosine similarity too.

Graph theory

This part of the book will discuss a few significant examples of how graphs and networks are related to each other. We will cover simple examples of NP-complete problem types and critical graph algorithms to know. There is a whole division of mathematics called graph theory, which is committed to studying graph structures. Network theory examines natural graphs or graph structures in the real world around us and within different disciplines. You might ask how network theory is other than graph theory?

A graph is a hypothetical structure that does not occur in the real. Any real-world data picture reduces to a graph, but it is not a graph. A network topology decreases to a graph, and then algorithms like Dijkstra's and Kruskal's can be applied to it for numerous purposes like routing. All real-world situations are subsets of graph challenges (of that specific and unique domain).

Most problems of networking, particularly those concerning routing, reduce to a graph-based problem. In other words, graph theorists are intent on arbitrary inquiries regarding graphs. In contrast, network theorists are more involved in queries appropriate to the circumstances modeled by networks. A significant question in machine learning is to determine what a model can and cannot learn. A famous graph coloring NP-complete problem could be solved with the Greedy Algorithm to assign colors. The graph coloring problem has many applications: mobile radio frequency assignment, Sudoku, Register Allocation in CPU, Bipartite Graphs, Map Coloring, Making Schedule, or Timetable.

Perfectly Colored Petersen Graph

There is a prototype on our GitHub code base, and please check it out. It is a simple version of using a coloring algorithm on school scheduling tasks: https://github.com/PacktPublishing/Designing-and-Scaling-Graph-Applications-/tree/master/ch1/GraphColoring. Graph coloring is still an energetic field of research. A graph coloring is an assignment of tags, called colors, to a graph's vertices such that no two adjacent vertices share an identical color. Petersen graph is an undirected graph with ten vertices and 15 edges. The algorithm does not guarantee to use of minimum colors, but it is ensuring a higher bound of the number of colors, and it described better here: http://mrsleblancsmath.pbworks.com/w/file/fetch/46119304/vertex%20coloring%20algorithm.pdf

Other examples of the most common interview and industry-used algorithms that you must know are A*, Topological Sorting, Single Source Shortest Path, Random Walk, but of course, the most common are breadth-first search (BFS) and depth-first search (DFS). 

Centrality Algorithm

Centrality is understanding which nodes are more important in the network. Most of the case's importance could mean different things like a social network bridge between distinct groups. Suppose you want to reach out to too many people on Instagram with advertisements. In that case, you probably need to find influencers in a network first, and you might consider leveraging multiple algorithms like closeness centrality, betweenness centrality, and PageRank.

Community Detection Algorithm

Community detection uncovers hubs and hierarchies. Virtual nodes could help find dynamics such as credibility, accessibility, the speed at which things spread, and bridges between groups. Centrality Degree is the simplest of the algorithms that calculate the number of relationships divided by the total number of nodes, and high degree nodes could skew it.

Closeness centrality is a way of spotting nodes that can spread info efficiently through a subgraph. It is mostly handy only for connected graphs because, in the case so unconnected, it is an infinite distance between two nodes where there is no path between them. Pioneriya is a communistic term based on the meaning to be a pioneer to be the first and the best. The collective name of children's communist organizations exists in different countries, including post-Soviet Russia. Welcome to the bright future of Communism people.

If you want to build a small graph of pioneers' society, you will want to find the easiest way to brainwash all these people. You probably wish also to make it as fast as you can. You will need to detect the right persons to start. We hope graph algorithms can help me find the most "powerful" persons and turn them to follow Pioneer Organization. In the first iteration, we turn these early people to be pioneers. Every next iteration, all friends of this person also come to be pioneers and so on, until every person becomes pioneers -communism rails.

 

Pathfinding Algorithm

Pathfinding is fundamental in graph theory. It is one of the most frequent tasks performed with graph algorithms and is a precursor for several other analysis types. Pathfinding search processes explore a graph either for general discovery or detailed search. These algorithms slice paths through the graph, but there is no expectation that they are computationally best possible.

Furthermore, there are many more algorithms that we will not be able to cover in this chapter, but you could look them up because they could apply to machine learning problems that you could solve with GCN, RPN, RDN, PRM, and other networks.

https://www.geeksforgeeks.org/graph-data-structure-and-algorithms/

Minimum Weight Spanning Tree

The Minimum Weight Spanning Tree (MST) begins from a given node and finds all its accessible nodes and relationships that link the nodes with the least available weight. Prim's algorithm is only of the most straightforward and best-recognized minimum spanning-tree procedures. We will start from the smallest weight edge to find the MST and keep selecting edges that do not have form any circuit or loop with the previously selected edges.

Kruskal's Minimum Spanning Tree Algorithm usually sorts all edges by weight and picks the smallest edge first. After this, it checks if the spanning tree has a cycle so far and if the cycle is not formed, include this edge, and else it will discard it. It is a repeatable step until there are V-1 edges in the spanning tree.

Kruskal's algorithm finds spanning minimum forest

Last but not least, the popular ML algorithm K-Means can detect clusters in the graph. K-means is an unsupervised model, so the data is unlabeled. But the model mathematically allocates each data point to a cluster. K-Means is a great place to start, and the first unsupervised method I recommend learning.

K-Means is used for signal processing, but now it uses a linear classifier for NLP tasks in entity recognition or computer vision. The algorithm is using squared Euclidean distances.

Breadth-First Search

BFS is so popular because it is used in Shortest Path, Connected Components, and Closeness Centrality algorithms and could find the shortest path between nodes. It starts from a selected node and discovers all its neighbors at one hop away before visiting all the neighbors at two jumps out, and so on. This algorithm is helpful for searching when the probability of finding the node searched for declines with distance.

There are numerous termination conditions supported for the traversal, constructed on either reaching one of the multiple target nodes, reaching a maximum depth, exhausting a given budget of traversed relationship cost, or just crossing the whole graph. The output of the procedure includes information about which nodes were visited and in what order.

Visualization of architecture in BFS and DFS

BFS Cons: Each level of nodes stores before creating the next one. Hence it uses more memory. If the solution is far away from the root, BFS will consume more time.

Depth First Search

There is a distinction with the DFS algorithm that starts from one of the neighbors and crosses as far as possible. Pathfinding algorithms are useful for identifying the way that our data is connected. If you want to find an objective node at a considerable distance and explore a random path, you will have a decent victory probability. There are multiple closure conditions supported for the traversal, either reaching one of several target nodes, reaching a maximum depth, exhausting a given budget of traversed relationship cost, or just traveling the entire graph. The output of the procedure includes information about which nodes were visited and in what order. The algorithm has less time and space complexity. Topological implements with a DFS or BFS. DFS cons: cannot check duplicate nodes or guarantee to find a solution. Also, it cannot find a minimal solution if two solutions are available. Complexity depends on the number of paths.

Example of (DFS) - topological sort algorithm

Page Rank

Page Rank (popularized by Google) shows how the current node's importance is calculated from its linked neighbors. The algorithm finds the most powerful features for machine learning extraction and ranking text for entity relevance in natural language processing. Let us say you are looking for the COVID-19 vaccine, and you want to pick the highest overall impact score of biologic function; it may not be the most connected one. It could be a gene with the most relationships with other, more significant attributes.

Once we have extracted relevant features for the machine learning model, we can improve our training using graph algorithms like PageRank. That is necessary to prioritize the elements with the most influence. Prioritization will enable us to adequately represent our data while eliminating noisy variables that could degrade results or slow processing – feature reduction techniques.

Weighted Page Rank avoiding favor of older pages against new

When we use the page rank algorithm, we must consider structural and time information. We also must deal with the dynamism of data and alleviate the bias through decayed time-weighted impacts. The PageRank algorithm calculates each node's importance within the graph based on the number of external relationships and the corresponding source nodes' importance. The underlying assumption, approximately speaking, is that a page is only as relevant as the pages that link to it.

Due to the exponential expansion of possible paths with ever-increasing distance, numerous of the approaches also have astronomical algorithmic complexity. Mercifully, optimized algorithms exploit specific graph structures, memorize previously explored parts, and parallelize strategies. We will cover such optimizations and other techniques in the book's next chapters, but for now, let us focus on machine learning and significant data graph relationships.

Graphs and big data

Let us say you want to find how to do machine learning predictions using graph algorithms. You probably will want to leverage connected feature extraction and its use in predicting relationships, but you might need to use contextual graph data for better predictions. Machine learning (ML) training involves tons of data to a model and enables learning how to process it.

But why we need "a lot of data" in this case, and how much is enough to call it big data? Recent successes in ML predictions happen because parallel computes the cloud's power and practicality of running massive computations even on small user desktop machines with GPU and more scalable FPGA semiconductors. To increase ML accuracy while also making solutions more broadly applicable, we need to incorporate a lot of contextual information to make this decision yourself. You learn on top of contextual facts surrounding the situation. People estimate missing information and determine how to apply lessons to new jobs where context improves predictions. AI is good at specific, well-defined tasks but struggles with ambiguity and so we could use graph-enhanced ML to fill missing context.

Unfortunately, many ML techniques miss context and are built on top of tuples, leaving out many predictive relationships and network data. It is hard to access and process context even if available. One feature extraction and selection tool is distilling big data and attributes down to a set of representative descriptive characteristics. Feature selection determines the subset of extracted features that are most important or influential to a target goal. For example, if we have 60 features and 10 of them together explain 89% of what we need to predict, we can eliminate the other 50 components.  According to the Gartner research agency, technology is changing fast, and trends point us to the graphs and analytics we could build. We will align essential techniques on how Graph Analytics makes today and explain things like embeddings and graph distances in scale in the next feature sections.

Gartner Research predicts that Graph Analytics is one of the Top 10 Trends

One of the common tricks in performance optimizations is to use cache in memory. But it is sometimes more expensive to have enough memory to get everything required in it. To do a graph walk and boost speed on the graph, we must run multiple computers with distributed memory cache. Getting RAM from the CPU is slower than reading from the cache. It is much more efficient and probably 200 times faster. When graphs are stored as memory pointers referencing each other, it is evident that each node is in a unique memory location. It is hard for a computer CPU to crunch this data from one node to another because it must jump across a memory from one node to another.

Random Walks on Graph: https://github.com/VHRanger/nodevectors/tree/master/examples

We must move around our memory as little as possible, so memory access is no longer a traffic jam. With solutions like CSR matrix, each edge has weight, and with an array of such edges pointing index of memory where values are, we could structure data much better. For example, in the Facebook graph, we have roughly 260 million nodes and 16 billion edges. At a minimum, it is 132 gigabytes of RAM to fit all of them, so we have to store the data in a streaming format in separate files on an SSD drive with index and pointer to the memory location other arrays to read them directly.

Suppose we want to do more complex calculations like machine learning on such a graph. Each column embodies a node and feeds it to the model, but we don't need to do this if you use embeddings.

Add comment