DS lore

words about stuff

"Hello World!" in PyTorch BigGraph

In this post I explain why graph embedding is cool, why Pytorch BigGraph is a cool way to do it and show how to use PBG on two very simple examples - the “Hello World!” of graph embedding.

All the code can be found here. With this you can quickly get started embedding your own graphs.

Example: Graph of movies

Before we get started, here’s a motivating example: visualisation of the Movies Dataset from Kaggle.

The above embedding was based on a multi-relation graph of people working on movies (actors, directors, screenwriters, lightning, cameras etc.). The visualisation is the result of running UMAP on the embeddings of the most popular movies (ignoring embeddings of people which were a by-product).

And here’s the same set of movies but with a different embedding:

This embedding was based on the graph of movie ratings. The nodes correspond to movies and raters. There are 3 types of edges - ‘this user hated this movie’, ‘this user found this movie acceptable’, ‘this user loved this movie’ - corresponding to ratings 1 to 2.5, 3 to 3.5, 4 to 5 out of 5.

I encourage you to mouse over the graphs to reveal clusters of movies related by either overlapping cast and crew (first plot) or by overlapping fanbase (second plot). It’s quite fun.

Note that one could use either of these embeddings (or a combination of the two) as a basis for a movie recommender system.

Why graph embeddings?

Graph embeddings are a set of algorithms that given a graph (set of nodes connected by edges) produce a mapping node -> n-dimensional vector (for some specified n). The goal of embedding is for the metric relationships between vectors to reflect connections of the graph. If two nodes are connected, their embeddings should be close in vector space (under some metric), if they are not - the embeddings should be distant.

If successful, the embedding encodes much of the structure of the original graph but in a fixed-width, dense numeric format that can be directly used by most machine learning models.

Unlike their better known cousins - word embeddings - graph embeddings are still somewhat obscure and underutilised in the data science community. That must be in part because people don’t realise that graphs are everywhere.

Most obviously, when the entities you’re studying directly interact with each other - they form a graph. Think - people following each other on social media or bank customers sending each other money.

More common in real life applications are bipartite graphs. That’s when there are two kinds of entities - A and B - and As link with Bs but As don’t link with other As directly and neither do Bs with other Bs. Think - shoppers and items, movies and reviewers, companies and directors. Embedding these kinds of graphs is a popular technique in recommender systems - see for example Uber Eats.

Text corpora are graphs too! You can represent each document in a corpus and each word in a document by a node. Then you connect a document-node to a word-node if the document contains the word. That’s your graph. Embedding this graph yields a word embedding + document embedding for free. (you can also use a sliding window of a few words instead of full document for better results). This way you can get a good quality word embedding using graph embedding techniques (see e.g. this).

In short - graph embeddings are a powerful and universal feature engineering technique that turns many kinds of sparse, unstructured data into dense, structured data for use in downstream machine learning applications.

Why PyTorch BigGraph

There are heaps of graph embedding algorithms to pick from. Here’s a list of models with (mostly Python) implementations. Unfortunately most of them are little better than some researcher’s one-off scripts. I think of them less as tools that you can pick up and use and more as a starting point to building your own graph embedder.

PyTorch BigGraph is by far the most mature of the libraries I have seen. It:

  • has (some) documentation.
  • includes utils for transforming edge-list data to it’s preferred format.
  • includes multiple metrics for monitoring performance during as well as after training
  • supports multi-relation and multi-entity graphs
  • is customizable enough that it supersedes multiple other older models
  • is CPU-based - which is unusual and seems like a wasted opportunity but it does make using it easier and cheaper And most importantly:

  • it is fast and works reliably on even very big graphs (being disk-based, it won’t run out of RAM)

It even includes a distributed mode for parallelizing training on the cluster. Unless the nodes of your graph number in the billions though, IMHO it is easier to just spin up a bigger machine at your favourite cloud platform. In my experiments a 16 CPU instance is enough to embed a graph of 25m nodes, 30m edges in 100d in a few hours.

If you’re curious about

Why this tutorial?

If PBG is so great why does it need a tutorial?

It seems to me that the authors were so focused on customizability that they let user experience take a back seat. Simply put - it takes way too many lines of code to do the simplest thing in PBG. The simplest usage example included in the repository consists of two files - one 108 and one 46 lines long. This is what it takes to do the equivalent of model.fit(data).predict(data).

I’m guessing this is the reason why the library hasn’t achieved wider adoption. And without a wide user base, who is there to demand a friendlier API?

I have wasted a lot of time before I managed to refactor the example to work on my graph. What follows is my stripped down to basics version of graph embedding that should work out of the box - the “Hello World!” - and one that you can use as a template for more complicated tasks.

I found another similar tutorial on Towards Data Science but the code didn’t work for me (newer version of PBG perhaps?).

Hello World!

The full code of the example, with comments, is here.

First thing to do is installing PBG. As of this writing, the version available on PyPi is broken (crashes on running the first example) and you have to install it directly from github:

1
pip install git+https://github.com/facebookresearch/PyTorch-BigGraph.git

Full requirements are here.

The graph we will be embedding consists of 4 nodes - A, B, C, D and 5 edges between them. It needs to be saved as a tab-separated file like so:

1
2
3
4
5
A   B
B   C
C   D
D   B
B   D

Before we can apply PBG to the graph, we will have to transform it to a PBG-friendly format (fortunately P BG provides a function for that). Before we do that, we have to define the training config. The config is a data structure holding all the settings and hyperparameters - like how many partitions to use (1 unless you want to do distributed training), what types of nodes there are (only 1 type), what types of edges between them etc.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
raw_config = dict(
    # graph metadata will go here
    entity_path=DATA_DIR,
    edge_paths=[
        # graph data in HDF5 format will be saved here
        DATA_DIR + '/edges_partitioned',
    ],
    # trained embeddings as well as temporary files go here
    checkpoint_path=MODEL_DIR,
    # Graph structure
    entities={
        "WHATEVER": {"num_partitions": 1}
    },
    relations=[
        {
            "name": "doesnt_matter",
            "lhs": "WHATEVER",
            "rhs": "WHATEVER",
            "operator": "complex_diagonal",
        }
    ],
    dynamic_relations=False,
    dimension=4, # silly graph, silly dimensionality
    global_emb=False,
    comparator="dot",
    num_epochs=7,
    num_uniform_negs=50,
    loss_fn="softmax",
    lr=0.1,
    regularization_coef=1e-3,
    eval_fraction=0.,
)

Next, we use the config to transform the data into the preferred format using a helper function from torchbiggraph.converters.importers.convert_input_data function. Note that the config needs to be parsed first using another helper function because nothing is simple with PyTorch BigGraph.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
setup_logging()
config = parse_config(raw_config)
subprocess_init = SubprocessInitializer()
# path to the tsv file with the graph edges
input_edge_paths = [Path(GRAPH_PATH)]


convert_input_data(
    config.entities,
    config.relations,
    config.entity_path,
    config.edge_paths,
    input_edge_paths,
    TSVEdgelistReader(lhs_col=0, rel_col=None, rhs_col=1),
    dynamic_relations=config.dynamic_relations,
)

Having prepared the data, training is straightforward:

1
train(config, subprocess_init=subprocess_init)

Important note: the above code (both data preparation and training) can’t be at the top level of a module - it needs to be placed inside a if __name__ == '__main__': block or some equivalent. This is because PTBG spawns multiple processes that import this very module at the same time. If this code is at the top level of a module, multiple processes will be trying to create the same file simultaneously and you will have a bad time!

After training is done, we can load the embeddings from a h5 file. This file doesn’t include names of the nodes so we will have to look those up in one of the files created by the preprocessing function.

1
2
3
4
5
6
7
8
9
10
11
nodes_path = DATA_DIR + '/entity_names_WHATEVER_0.json'
embeddings_path = MODEL_DIR + "/embeddings_WHATEVER_0.v{NUMBER_OF_EPOCHS}.h5" \
    .format(NUMBER_OF_EPOCHS=raw_config['num_epochs'])

with open(nodes_path, 'r') as f:
    node_names = json.load(f)

with h5py.File(embeddings_path, 'r') as g:
    embeddings = g['embeddings'][:]

node2embedding = dict(zip(node_names, embeddings))

The final result will look something like this:

1
2
3
4
5
6
{
    'D': array([-0.0803747 ,  0.2267056 ,  0.09371386,  0.12704636], dtype=float32),
    'A': array([ 0.1961012 , -0.5856796 , -0.28119525, -0.4106653 ], dtype=float32),
    'B': array([ 0.10987103, -0.40970576, -0.18397042, -0.11587717], dtype=float32),
    'C': array([-0.16625881,  0.51607686,  0.2659427 ,  0.32392678], dtype=float32)
}

This is it!

Multi-relation graphs

The second example will feature PBG’s big selling point - the support for multi-relation graphs. That means graphs with multiple kinds of edges. We will also throw in multiple entity types for good measure.

Imagine if Twitter and eBay had a baby. Data genereated on this unholy abomination of a website might look something like this:

1
2
3
4
5
6
7
8
9
10
11
alice   bought  fridge
alice   bought    bike
bob bought  laptop
carol   bought  fridge
carol   sold  laptop
carol   sold    bike
dave    sold    fridge
alice   follows bob
bob follows carol
bob hates dave
dave hates carol

Here users follow other users as well as buy and sell items to each other. As a result we have two types of entities - users and items - and four types of edges - ‘bought’, ‘sold’, ‘follows’ and ‘hates’.

We want to jointly embed users and items in a way that implicitly encodes who is buying and selling what and following or hating whom.

We could do it by ignoring relation types and embedding it as a generic graph. That would be wrong because ‘follows’ and ‘hates’ mean something quite different and we don’t want to represent Bob and Dave as similar just because one of them follows Carol and the other hates her.

Or we could do it by separately embedding 4 graphs - one for each type of relation. But that’s not ideal either because we’re losing valuable information. In our silly example Alice would only appear in the graphs of “bought” and of “follows”. Dave only appears in graphs of “sold” and “hates”. Therefore the two users wouldn’t have a common embedding and it wouldn’t be possible to calculate distance between them. A classfier trained on Alice couldn’t be applied to Dave.

We can solve this problem by embedding the full multi-relation graph in one go in PBG.

Internally, PBG deals with different relation types by applying a different (learned) transaformation to a node’s embedding in the context of a different relation type. For example it could learn that that if A ‘follows’ B, they should be close in vector space but when A ‘hates’ B, they should by close after flipping the sign of all coordinates of A - i.e. they should be represented by opposite vectors.

From the point of view of a PBG user the only difference when embedding a multi-relation, multi-entity graph is that one has to declare all relation types and entity types in the config. We also get to chose a different transformation for each relation (though I can’t imagine why anyone would). The config dict for our Twitter/eBay graph would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
raw_config = dict(
    # I/O data
    entity_path=DATA_DIR,
    edge_paths=[
        DATA_DIR + '/edges_partitioned',
    ],
    checkpoint_path=MODEL_DIR,
    # Graph structure
    entities={
        "user": {"num_partitions": 1},
        "item": {"num_partitions": 1},
    },
    relations=[
        {
            "name": "bought",
            "lhs": "user",
            "rhs": "item",
            "operator": "complex_diagonal",
        },
        {
            "name": "sold",
            "lhs": "user",
            "rhs": "item",
            "operator": "complex_diagonal",
        },
        {
            "name": "follows",
            "lhs": "user",
            "rhs": "user",
            "operator": "complex_diagonal",
        },
        {
            "name": "hates",
            "lhs": "user",
            "rhs": "user",
            "operator": "complex_diagonal",
        }
    ],
    dynamic_relations=False,
    dimension=4,
    global_emb=False,
    comparator="dot",
    num_epochs=7,
    num_uniform_negs=50,
    loss_fn="softmax",
    lr=0.1,
    regularization_coef=1e-3,
    eval_fraction=0.,
)

Once embedding is trained, the embeddings can be loaded the same way as with a generic graph, the only difference being that each entity type has a separate embedding file.

Full code is here.

Happy embdding!

Comments