This blog summarizes the data graphs that are frequently used in isomorphism-based subgraph matching algorithm/system research. The surveyed literature is listed in the “Reference” section.
Problem Definition of Isomorphism-based Subgraph Matching
Given a single large data graph $D$ and a small query graph $q$, the target of subgraph isomorphism is to find all subgraphs of $D$ that are isomorphic to $q$.
The data graph and query graph may have labels attached to the vertices and/or edges. In those cases, a subgraph of $D$ matches the query graph $q$ if and only if they are isomorphic and their corresponding vertex/edge labels match.
Classification of Graph Datasets
The graph datasets that are used as data graphs in research can be grouped into two groups: real-world graphs and synthetic graphs.
Real-world Graphs
The statistics and description of the real-world graph datasets are summarized in the following table. Each graph dataset is given a unified code name.
Note: The statistics of each graph are retrieved from the literature. Their precision values depend on how to process the dataset files. Since different authors may process the original dataset files in different ways, it is common that the statistics do not agree with each other in literature.
The average degree $\bar{d}$ is defined as $\bar{d}=V/E$ (for directed graphs) and $\bar{d}=2V/E$ (for undirected graphs), where $V$ is the number of vertices and $E$ is the number of edges.
*The table is sorted by $|E|$ in the ascending order.
Table: Statistics of Real-world Graphs
Code Name | Dataset Name | Directed | Labeled | V | E | Average Degree |
---|---|---|---|---|---|---|
human | Human protein-protein interaction network | NO | Vertex | 4.70E+03 | 8.60E+04 | 36.60 |
as-caida | CAIDA AS Relationships Datasets | NO | NO | 2.65E+04 | 1.07E+05 | 8.07 |
wordnet | WordNet 1.7.1 | YES | Vertex and Edge | 8.27E+04 | 1.33E+05 | 1.61 |
enron-email | Enron email network | NO | NO | 3.67E+04 | 1.84E+05 | 10.02 |
facebook-wosn | Facebook (WOSN) | NO | NO | 6.37E+04 | 8.17E+05 | 25.64 |
com-dblp | DBLP collaboration network and ground-truth communities | NO | Vertex | 3.17E+05 | 1.05E+06 | 6.62 |
com-youtube | Youtube social network and ground-truth communities | NO | Vertex | 1.13E+06 | 2.99E+06 | 5.27 |
amazon0601 | Amazon product co-purchasing network from June 1 2003 | YES | YES | 4.03E+05 | 3.39E+06 | 8.40 |
web-google | Google web graph | YES | NO | 8.76E+05 | 4.32E+06 | 4.94 |
web-berkstan | Web graph of Berkeley and Stanford | YES | NO | 6.85E+05 | 7.60E+06 | 11.09 |
as-skitter | Autonomous systems by Skitter | NO | NO | 1.70E+06 | 1.11E+07 | 13.08 |
ego-gplus | Social circles: Google+ | YES | NO | 1.08E+05 | 1.37E+07 | 127.06 |
imdb | The International Movie Database (IMDb) graph | YES | Vertex | 5.00E+06 | 1.45E+07 | 2.90 |
uspatents | Patent citation network | YES | Vertex | 3.77E+06 | 1.65E+07 | 4.38 |
eu-2005 | A small crawl of the .eu domain, mainly useful for debugging and testing purposes. | YES | NO | 8.63E+05 | 1.92E+07 | 22.30 |
us-road | Challenge 9 - Road map of USA | YES | NO | 2.39E+07 | 5.83E+07 | 2.44 |
soc-livejournal | LiveJournal online social network | YES | NO | 4.85E+06 | 6.90E+07 | 14.23 |
com-orkut | Orkut online social network | NO | Vertex | 3.07E+06 | 1.17E+08 | 76.28 |
uk-2002 | Web graph from a 2002 crawl of the .uk domain. | YES | NO | 1.85E+07 | 2.98E+08 | 16.10 |
yago | YAGO 4 knowledge base | YES | Vertex and Edge | 6.70E+07 | 3.43E+08 | 5.12 |
eu-road | Roadmap of Europe | NO | NO | 1.74E+08 | 3.48E+08 | 2.00 |
billion-triple | Billion Triples Challenge 2009 Dataset | YES | YES | 1.65E+08 | 3.61E+08 | 2.19 |
dbpedia-tiny-diamond | The DBpedia Latest Core Release | YES | Vertex and Edge | NA | 9.00E+08 | NA |
What is Twitter, a Social Network or a News Media? | YES | NO | 4.16E+07 | 1.46E+09 | 35.10 | |
friendster | Friendster online social network | NO | Vertex | 6.56E+07 | 1.81E+09 | 55.06 |
reddit-social-media | Reddit social media graph | YES | Vertex and Edge | 3.90E+09 | 7.00E+09 | 1.79 |
yahoo | G2 - Yahoo! AltaVista Web Page Hyperlink Connectivity Graph, circa 2002 (multi part) (Hosted on AWS) | YES | NO | 1.40E+09 | 1.29E+10 | 9.21 |
clueweb12 | ClueWeb12 Web Graph | YES | To Construct Manually | 9.78E+08 | 4.26E+10 | 43.51 |
wdc-2012 | Web Data Commons - Hyperlink Graph 2012 | YES | Vertex | 3.50E+09 | 1.29E+11 | 36.71 |
The description of each dataset is summarized in the following table.
Table: Description of Real-world Graphs
Code Name | Dataset Name | Graph Type | Download Link | Description |
---|---|---|---|---|
human | Human protein-protein interaction network | Protein-protein interaction network | https://github.com/yspark-dblab/gcare (See the datasets.tar.gz part) | Shijie Zhang, Shirong Li, and Jiong Yang. 2009. GADDI: distance index based subgraph matching in biological networks. In Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. 192–203. You can download the processed version provided by the g-care project. |
as-caida | CAIDA AS Relationships Datasets | computer network | http://snap.stanford.edu/data/as-caida.html | The dataset contains 122 CAIDA AS graphs, from January 2004 to November 2007 - http://www.caida.org/data/active/as-relationships/. Each file contains a full AS graph derived from a set of RouteViews BGP table snapshots. Vertices represent routers. Edges represent routes between routers. |
wordnet | WordNet 1.7.1 | natural language | http://vlado.fmf.uni-lj.si/pub/networks/data/dic/Wordnet/Wordnet.htm | Graphs of relations among words, retrieved from WordNet v1.7.1. Vertices represent words. Edges represent the relations between words. Vertex labels represent word types (like verb, noun, …). Edge labels represent types of relations among words (like similar, cause, …). |
enron-email | Enron email network | communication network | http://snap.stanford.edu/data/email-Enron.html | Email communication network from Enron. Vertices are email addresses. Edges are communication relationships between email addresses. |
facebook-wosn | Facebook (WOSN) | social network | http://socialnetworks.mpi-sws.org/data-wosn2009.html | This undirected network contains friendship data of Facebook users. A node represents a user and an edge represents a friendship between two users. The dataset is obviously not complete and contains a very small subset of the total Facebook friendship graph. |
com-dblp | DBLP collaboration network and ground-truth communities | collaboration network | http://snap.stanford.edu/data/com-DBLP.html | DBLP co-authorship collaboration network. Vertices are authors. There is an edge among two vertices if the two authors have published at least one paper together. Each connected component is regarded as a community. Each label corresponds to a community. |
com-youtube | Youtube social network and ground-truth communities | social network | http://snap.stanford.edu/data/com-Youtube.html | Youtube social network. Vertices represent users. Edges represent friendships between users. Labels represent ground-truth communities. |
amazon0601 | Amazon product co-purchasing network from June 1 2003 | E-commerce | http://snap.stanford.edu/data/amazon0601.html (graph), http://snap.stanford.edu/data/amazon-meta.html (label & property) | Network was collected by crawling Amazon website. It is based on Customers Who Bought This Item Also Bought feature of the Amazon website. If a product i is frequently co-purchased with product j, the graph contains a directed edge from i to j. Vertices represent products. Edges represent the co-purchased relationships. |
web-google | Google web graph | WWW | http://snap.stanford.edu/data/web-Google.html | Web hyperlink graph released by Google. Vertices represent web pages. Edges represent hyperlinks between web pages. |
web-berkstan | Web graph of Berkeley and Stanford | WWW | http://snap.stanford.edu/data/web-BerkStan.html | Nodes represent pages from berkely.edu and stanford.edu domains and directed edges represent hyperlinks between them. The data was collected in 2002. |
as-skitter | Autonomous systems by Skitter | computer network | http://snap.stanford.edu/data/as-Skitter.html | Internet topology graph. From traceroutes run daily in 2005 - http://www.caida.org/tools/measurement/skitter. From several scattered sources to million destinations. 1.7 million nodes, 11 million edges. |
ego-gplus | Social circles: Google+ | social network | http://snap.stanford.edu/data/ego-Gplus.html | This dataset consists of ‘circles’ from Google+. Vertices represent users. Edges represent the relationships between users. |
imdb | The International Movie Database (IMDb) graph | knowledge graph | https://www.imdb.com/interfaces/ | More details of how to construct the dataset is available in the paper: T. Reza, M. Ripeanu, G. Sanders, and R. Pearce, “Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees,” in Proc. SIGMOD, New York, NY, USA, 2020, pp. 1115–1131. doi: 10.1145/3318464.3380566. |
uspatents | Patent citation network | citation network | http://snap.stanford.edu/data/cit-Patents.html | U.S. patent dataset is maintained by the National Bureau of Economic Research. Vertices represent patents. Edges represent citations among patents. Vertex labels can represent patent classification. |
eu-2005 | A small crawl of the .eu domain, mainly useful for debugging and testing purposes. | WWW | https://law.di.unimi.it/webdata/eu-2005/ | A small crawl of the .eu domain, mainly useful for debugging and testing purposes. This graph exhibits a very low locality, probably because the crawl was quite shallow (and the chosen domain is quite artificial anyway). Vertices present web pages. Edges represent hyper links between webpages. |
us-road | Challenge 9 - Road map of USA | road network | http://www.diag.uniroma1.it/challenge9/download.shtml | Road map of 12 USA road networks. Vertices represent key points. Edges present roads connected the key points. Edges have weights which represent either the distance or the travel time. |
soc-livejournal | LiveJournal online social network | social network | http://snap.stanford.edu/data/soc-LiveJournal1.html | Social network of LiveJournal. Vertices represent users of LiveJournal. Edges represent friendship relationships between users. |
com-orkut | Orkut online social network | social network | http://snap.stanford.edu/data/com-Orkut.html | Orkut social network and ground-truth communities. Vertices represent users. Edges represent friendships between users. Labels represent groud-truth communities. |
uk-2002 | Web graph from a 2002 crawl of the .uk domain. | WWW | https://law.di.unimi.it/webdata/uk-2002/ | This graph has been obtained from a 2002 crawl of the .uk domain performed by UbiCrawler. Vertices represent web pages. Edges present hyperlinks between web pages. |
yago | YAGO 4 knowledge base | knowledge graph | https://yago-knowledge.org/downloads/yago-4 | YAGO is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames. YAGO dataset is a knowledge graph stored in the standard RDF framework. Vertices represent entities. Edges represent relationships between entities. |
eu-road | Roadmap of Europe | road network | https://i11www.iti.kit.edu/resources/roadgraphs.php | Road map of Europe retrieved from the OpenStreetMap. Vertices represent crosses. Edges represent the roads connected crosses. |
billion-triple | Billion Triples Challenge 2009 Dataset | knowledge graph | http://km.aifb.kit.edu/projects/btc-2009/ | The major part of the dataset was crawled during February/March 2009 based on datasets provided by Falcon-S, Sindice, Swoogle, SWSE, and Watson using the MultiCrawler/SWSE framework. To ensure wide coverage, we also included a (bounded) breadth-first crawl of depth 50 starting from http://www.w3.org/People/Berners-Lee/card. |
dbpedia-tiny-diamond | The DBpedia Latest Core Release | knowledge graph | https://www.dbpedia.org/resources/latest-core/ | The DBpedia Latest Core Release is the small subset of of the total DBpedia releases that we are loading into the main DBpedia SPARQL endpoint and Linked Data Interface and Lookup Search. Dbpedia is a knowledge graph database. It is updated in a montly cycle. |
What is Twitter, a Social Network or a News Media? | social network | https://anlab-kaist.github.io/traces/WWW2010 | Twitter’s following graph. Vertices represent twitter user IDs. Edges represent the following relationships. | |
friendster | Friendster online social network | social network | http://snap.stanford.edu/data/com-Friendster.html | Friendster social network. Vertices represent users in the Friendster on-line gaming network. Edges represent friendship relations between users. Labels represent ground-truth communities. |
reddit-social-media | Reddit social media graph | social network | https://github.com/dewarim/reddit-data-tools | More details of how to construct the dataset is available in the paper: T. Reza, M. Ripeanu, G. Sanders, and R. Pearce, “Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees,” in Proc. SIGMOD, New York, NY, USA, 2020, pp. 1115–1131. doi: 10.1145/3318464.3380566. |
yahoo | G2 - Yahoo! AltaVista Web Page Hyperlink Connectivity Graph, circa 2002 (multi part) (Hosted on AWS) | WWW | https://webscope.sandbox.yahoo.com/catalog.php?datatype=g&guccounter=1 | This dataset contains URLs and hyperlinks for over 1.4 billion public web pages indexed by the Yahoo! AltaVista search engine in 2002. |
clueweb12 | ClueWeb12 Web Graph | WWW | https://law.di.unimi.it/webdata/clueweb12/ | This is the graph underlying the ClueWeb12 dataset. Vertices represent web pages. Edges represent hyper links between web pages. The labels needs to be manually constructed from the original dataset files by treating each domain as a label. |
wdc-2012 | Web Data Commons - Hyperlink Graph 2012 | WWW | http://webdatacommons.org/hyperlinkgraph/2012-08/download.html | The hyperlink graph extracted from the Common Crawl 2012 web corpus. The graph covers 3.5 billion web pages and 128 billion hyperlinks between these pages. Vertices represent web pages. Edges represent hyperlinks among web pages. Labels are retrieved from domain names. |
Synthetic Graphs
The synthetic graphs are mainly used to evaluate performance of distributed subgraph matching methods on big graphs. The widely-used synthetic graph generation models or generator are summarized below.
R-MAT Model
The R-MAT model [21] uses a recursive model to generate the adjacency matrix of a random graph. The R-MAT model can generate different kinds of random graphs, including E-R models and power-law graphs. The R-MAT model is used in Graph500 Benchmark to generate very big graphs.
There are some typical implementation of RMAT random graph generators:
Waterloo SPARQL Diversity Test Suite (WatDiv)
WatDiv [22] is a benchmark developed for RDF data managements systems. WatDiv can generate large RDF datasets with a wide spectrum of SPARQL queries. WatDiv has a data generator and a query generator. The description of WatDiv is available at its home page.
WatDiv also has a stream version to generate dynamic RDF graphs.
Lehigh University Benchmark (LUBM)
LUBM is a benchmark for RDF data management systems. It follows a university domain ontology to generate big RDF graphs. It includes a data generator and a suite of test queries. The homepage of LUBM is http://swat.cse.lehigh.edu/projects/lubm/.
LDBC Social Network Benchmarking (LDBC-SNB)
LDBC Social Network Benchmarking (LDBC-SNB) is a benchmark suite for graph databases. It can simulate both interactive workloads and business intelligence workloads of a social network. It includes a social network generator that generates complex property graphs and a set of benchmark queries. The home page of LDBC-SNB is https://ldbcouncil.org/benchmarks/snb/.
EvoGraph
EvoGraph [23] is a tool upscaling a given real-world graph to generate large simulated graphs. The generated simulated graphs can preserve the properties of the underlying real-world graph. The source code of EvoGraph is available at chan150’s GitHub repository.
References
- C. T. Duong, D. Hoang, H. Yin, M. Weidlich, Q. V. H. Nguyen, and K. Aberer, “Efficient Streaming Subgraph Isomorphism with Graph Neural Networks,” Proc. VLDB Endow., vol. 14, no. 5, pp. 730–742, 2021.
- X. Jin, Z. Yang, X. Lin, S. Yang, L. Qin, and Y. Peng, “FAST: FPGA-based Subgraph Matching on Massive Graphs,” in Proc. ICDE, 2021, pp. 1452–1463. doi: 10.1109/ICDE51399.2021.00129.
- D. Mawhirter, S. Reinehr, C. Holmes, T. Liu, and B. Wu, “GraphZero: A High-Performance Subgraph Matching System,” SIGOPS Oper. Syst. Rev., vol. 55, no. 1, pp. 21–37, 2021, doi: 10.1145/3469379.3469383.
- H. Kim, Y. Choi, K. Park, X. Lin, S.-H. Hong, and W.-S. Han, “Versatile Equivalences: Speeding up Subgraph Query Processing and Subgraph Matching,” in SIGMOD, New York, NY, USA, 2021, pp. 925–937. doi: 10.1145/3448016.3457265.
- Z. Yang, L. Lai, X. Lin, K. Hao, and W. Zhang, “HUGE: An Efficient and Scalable Subgraph Enumeration System,” in Proc. SIGMOD, New York, NY, USA, 2021, pp. 2049–2062. doi: 10.1145/3448016.3457237.
- W. Guo, Y. Li, M. Sha, B. He, X. Xiao, and K.-L. Tan, “GPU-Accelerated Subgraph Enumeration on Partitioned Graphs,” in Proc. SIGMOD, New York, NY, USA, 2020, pp. 1067–1082. doi: 10.1145/3318464.3389699.
- S. Sun and Q. Luo, “In-Memory Subgraph Matching: An In-Depth Study,” in SIGMOD, New York, NY, USA, 2020, pp. 1083–1098. doi: 10.1145/3318464.3380581.
- H. Zhang, J. X. Yu, Y. Zhang, K. Zhao, and H. Cheng, “Distributed Subgraph Counting: A General Approach,” Proc. VLDB Endow., vol. 13, no. 11, pp. 2493–2507, 2020.
- X. Jian, Y. Wang, X. Lei, Y. Shen, and L. Chen, “DDSL: Efficient Subgraph Listing on Distributed and Dynamic Graphs,” in Proc. DASFAA, Cham, 2020, pp. 632–640.
- L. Zeng, L. Zou, M. T. Özsu, L. Hu, and F. Zhang, “GSI: GPU-friendly Subgraph Isomorphism,” in Proc. ICDE, Apr. 2020, pp. 1249–1260. doi: 10.1109/ICDE48307.2020.00112.
- S. Sun, X. Sun, Y. Che, Q. Luo, and B. He, “RapidMatch: a holistic approach to subgraph query processing,” Proc. VLDB Endow., vol. 14, no. 2, pp. 176–188, Oct. 2020, doi: 10.14778/3425879.3425888.
- D. Yan, G. Guo, M. M. R. Chowdhury, M. T. Özsu, W.-S. Ku, and J. C. S. Lui, “G-thinker: A Distributed Framework for Mining Subgraphs in a Big Graph,” in Proc. ICDE, 2020, pp. 1369–1380. doi: 10.1109/ICDE48307.2020.00122.
- Y. Park, S. Ko, S. S. Bhowmick, K. Kim, K. Hong, and W.-S. Han, “G-CARE: A Framework for Performance Benchmarking of Cardinality Estimation Techniques for Subgraph Matching,” in SIGMOD, New York, NY, USA, 2020, pp. 1099–1114. doi: 10.1145/3318464.3389702.
- T. Reza, M. Ripeanu, G. Sanders, and R. Pearce, “Approximate Pattern Matching in Massive Graphs with Precision and Recall Guarantees,” in Proc. SIGMOD, New York, NY, USA, 2020, pp. 1115–1131. doi: 10.1145/3318464.3380566.
- M. Han, H. Kim, G. Gu, K. Park, and W.-S. Han, “Efficient Subgraph Matching: Harmonizing Dynamic Programming, Adaptive Matching Order, and Failing Set Together,” in Proc. SIGMOD, New York, New York, USA, 2019, pp. 1429–1446. doi: 10.1145/3299869.3319880.
- B. Bhattarai, H. Liu, and H. H. Huang, “CECI: Compact Embedding Cluster Index for Scalable Subgraph Matching,” in SIGMOD, 2019, pp. 1447–1462. doi: 10.1145/3299869.3300086.
- L. Lai et al., “Distributed subgraph matching on timely dataflow,” PVLDB, vol. 12, no. 10, pp. 1099–1112, 2019, doi: 10.14778/3339490.3339494.
- A. Mhedhbi and S. Salihoglu, “Optimizing Subgraph Queries by Combining Binary and Worst-Case Optimal Joins,” Proc. VLDB Endow., vol. 12, no. 11, pp. 1692–1704, 2019, doi: 10.14778/3342263.3342643.
- Z. Zhang, H. Wei, J. Xu, and B. Choi, “GScan: Exploiting Sequential Scans for Subgraph Matching,” in Proc. DASFAA, Cham, 2019, pp. 471–475.
- Z. Wang, R. Gu, W. Hu, C. Yuan, and Y. Huang, “BENU: Distributed Subgraph Enumeration with Backtracking-Based Framework,” in ICDE, Macao, Macao, Apr. 2019, pp. 136–147. doi: 10.1109/ICDE.2019.00021.
- D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A Recursive Model for Graph Mining,” in Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, April 22-24, 2004, 2004, pp. 442–446. doi: 10.1137/1.9781611972740.43.
- G. Aluç, O. Hartig, M. T. Özsu and K. Daudjee. Diversified Stress Testing of RDF Data Management Systems. In Proc. The Semantic Web - ISWC 2014 - 13th International Semantic Web Conference, 2014, pages 197-212. WatDiv available from http://dsg.uwaterloo.ca/watdiv/.
- H. Park and M.-S. Kim, “EvoGraph: An Effective and Efficient Graph Upscaling Method for Preserving Graph Properties,” in Proc. SIGKDD, New York, NY, USA, 2018, pp. 2051–2059. doi: 10.1145/3219819.3220123.