A Large-Scale Database for Graph Representation Learning

What is graph representation learning?

In Euclidean space, we can easily determine an object’s class by looking at it’s neighbors (see right-half of Figure 1). However, when working with graph data, how do we compare graphs?

Figure 1. Embedding molecular graphs into low dimensional space to determine if they are benign or toxic

What’s holding graph representation learning back?

Recent research focusing on developing graph kernels, neural networks and spectral methods to capture graph topology has revealed a number of shortcomings of existing graph benchmark datasets, which often contain graphs that are relatively:

How do we advance the field?

In order to solve these issues, we have been working to develop the worlds largest public graph representation learning database to date at Georgia Tech’s Polo Club of Data Science.

Comparison of MalNet with common graph classification datasets.
Android APK function call graph from the Banker++Trojan type, and Acecard family. Each FCG represents calling relationships between functions in a program, where nodes represent functions and edges indicate inter-procedural calls. Highlighted in blue is one potential execution path.

How do we download and explore MalNet?

We have designed and developed MalNet Explorer, an interactive graph exploration and visualization tool to help people easily explore the data before downloading. Figure 3 shows MalNet Explorer’s desktop web interface and its main components. MalNet Explorer is available online at: www.mal-net.org

An exploration panel on the left allows users to select from the available graph types and families. Users can then visually explore each function call graph on the right. Our goal is to enable users to easily study the data without installation or download.

Enabling New Discoveries

Current graph representation datasets are significantly smaller in scale, and much less diverse compared to MalNet. In light of this, we study what new discoveries can be made, that were previously not possible due to dataset limitations. We find the following 3 discoveries:

  1. Simple Baselines Surprisingly Effective. For the first time, using the largest graph database to date, our result confirms that advanced techniques in the literature do not well capture non-attributed graph topology.
  2. Training Data Governs Performance. We find that certain graph types have a large jump in predictive performance, while others increase linearly, or not at all.

Enabling New Research Directions

The unprecedented scale and diversity of MalNet opens up new exciting research opportunities for the graph representation community, including:

  1. Imbalanced Classification Research in the graph domain has yet to receive much attention, largely because no datasets existed. By releasing MalNet, with imbalance ratios of 7,827x (type)and 16,901x (family), we hope foster interest in this area.
  2. Reconsidering Merits of Simpler Graph Classification Approaches. From D2, we find that simpler methods can match or outperform more recent and sophisticated techniques, suggesting that current techniques have trouble capturing to capture non-attributed graph topology. We believe our results will inspire researchers to reconsider the merits of simpler approaches and classic techniques.
  3. Enabling Explainable Research. We observe that certain representation techniques better capture particular graph types across MalNet’s nearly 700 classes. This is a highly interesting result, as it could provide insight into when one technique is preferred over another (e.g., local neighborhood structure, global graph structure, graph motifs).

Want to read more?

For all of the nitty-gritty details of MalNet we released our paper on arXiv.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Scott Freitas

PhD student @ Georgia Tech. I work at the intersection of applied and theoretical machine learning.