Decentraland Communications System

Introduction

In order to be able to provide a completely decentralized solution for v1.0, it is necessary to migrate the current comms solution towards something that can be run and maintained by the community while still maintaining the most out of our current capabilities.

In this document, we try to find the current scenarios that are tackled by the solution in place and understand the limitations that we face towards our goal. Finally, using this information, we propose a new solution with the necessary steps to build it.

Current State

Use Cases

  1. Share player positions

  2. Logout notification

  3. New login notification

  4. Chat

  5. Private chat

  6. Profile data changes notification
    Comms server is now in charge of setting the user id to prove its authenticity

  7. Message bus (multiplayer state sharing)
    Some scenes already use this approach to sync state between clients, we might address this issue for 1.0 or maybe in a future update

Solution

Our current solution is a centralized publisher/subscriber schema and has two main actors:

  • Client
    • Kernel TypeScript client
    • Connects through web RTC to server
    • Subscribes to topics of interest
      • Parcels
        • Positions of other players in the current parcel (UC1)
        • Logout notification by sending a player position outside of the known world (UC2)
        • Chat messages in the current parcel (UC4)
        • Profile data change notifications (UC6)
      • Scenes
        • State sharing to accomplish multiplayer-like experiences (UC7)
      • User
        • New login of the same user to disable previous sessions (UC3)
  • Server
    • Go based backend service
    • Maintains a list of topics and participants
    • Forwards messages to the clients whenever a new message is received in their interests

The communication between these two is based on a web RTC connection exchanging Protobuf binary messages.

A set of STUN/TURN servers may also be used to establish the web RTC connection as the third actor in this system. The list is currently defined by 4 of Google’s STUN servers plus our own TURN server.

A Decentralized Approach

We focus our new approach towards a system where:

  • A server is used at a minimum to help select and establish communication channels between peers
  • Peers are the main actors, control communication flow and data exchanged

Assumptions

To be reviewed

  • Server will work IRC-like
  • Server will share room participants
    • Maintain a list of peers currently inside each room
    • Rooms should be parcels mainly (to share positions, profile data change notifications)
    • Scenes could work as rooms to support sharing state (UC7)
  • Peers will be limited to a certain amount of connections
    • Will need to tune this, but around 8-10 is the first idea
  • Peers will connect directly to others and exchange updates of position and messages (first approach)
    • Based on these two assumptions, peers won’t have a complete vision
    • We should think of a clustering/matchmaking strategy to have a consistent vision of players (so that if A sees B, B sees A; etc.)
  • Peers may relay messages from others (extended version)
  • Peers will update server with their rooms
  • Peers may decline new connections
  • Communications will be web RTC
  • Implementation will be based on TypeScript + Protobuf + Peer.js

Solution Proposal

Server (AKA Lighthouse)

  1. Maintain & share a list of topics/rooms (users per scene, parcel)
  2. Share webrtc offers between clients (with userid possibly)
  3. Share room state between servers

Room
An arbitrary identifier.

Operations

  1. Scan rooms
    Retrieves the list of rooms that the server handles

  2. Join room
    Sets the peer state inside a room (if it was not present already)

  3. Leave room
    Removes peer from the room (if it was present)

  4. Scan room
    Retrieves the list of peers present in the room

API Endpoints

  • GET /rooms[?userId=] -> returns list of rooms. Includes users per room by default. If a userId is specified, it returns the rooms which that user has joined.

  • GET /room/:id -> returns list of users in a room with :id

  • PUT /room/:id { userid, nickname } -> adds a user to a particular room. If the room doesn’t exists, it creates it.

  • DELETE /room/:id/users/:userId -> deletes a user from a room. If the room remains empty, it deletes the room.

Test Resources

curl localhost:9000/rooms

curl localhost:9000/rooms\?userId=asdf2

curl -X PUT localhost:9000/rooms/room2 -d '{ "id": "asdf" }' -H "Content-Type: application/json"

curl -X DELETE localhost:9000/rooms/room2/users/asdf1

curl localhost:9000/rooms/room1

Client

  1. Update topics to lighthouse
  2. Update position between peers
  3. Notify profile data changes

Peer APIs

  • Get rooms from lighthouse
  • Join room (connect with all peers that are not already connected)
  • Leave room (disconnect from all peers that are not in any joined room)
  • Send message to room (to all users in said room). Option: reliable?: Boolean if the message should be reliably delivered.
  • Receive messages from room, knowing which peer sent it

Choice of Rooms

As a rule of thumb we are planning to use a one to one mapping of the current communication topics to rooms. Meaning:

  1. Parcels

As a room, the peers will be sharing their positions with each other when inside a parcel as well as profile data change notifications and chat messages.

  1. Scenes

Scene instances in each client will share this room to sync their state allowing a mutliplayer-like experience.

For this, it is also important that the peers chosen to be part of a player experience is consistent (i.e. if three players are having a shared experience, they should all be connected together and no one else should be part of it).

  1. User ids

A player is in a room within itself to allow a possible new Explorer instance to override the previous session and disable the previous client.

MVP (current implementation)

Requirements

  • One server
  • Multiple rooms
  • One reliable message type (e.g. chat)
  • One unreliable message type (e.g. cursor position)

Scenarios

  1. Sharing reliable messages in room

    1. Peer A enters room A
    2. Peer B enters room A
    3. Peer A shares message M in room A
    4. Peer B receives message M from A
  2. Disconnection on leave

    1. Peer A enters room
    2. Peer B enters room
    3. Peer A and B connect
    4. Peer A leaves room
  3. Sharing unreliable messages in room

    1. Peer A enters room A
    2. Peer B enters room A
    3. Peer A shares message M multiple times in room A
    4. Peer B receives message M at least once from A

Further development

Requirements

  • Multiple servers
  • Multiple rooms
  • Multiple message types

Scenarios

  1. Sync rooms
    1. Peer A connects to server A
    2. Peer A enters room A
    3. Server B connects to server B
    4. Peer B connects to server B
    5. Peer B scans room B and sees A
    6. Peer B enters room B
    7. Peer A scans room B and sees B

Prototype + Demo conclusions

What has been done:

All the implemented code is included in one repository: https://github.com/decentraland/catalyst/tree/master/comms

This repo has three components:

  • A mockup for the lighthouse server that exposes the rooms and the users in each room
  • A peer library that has code to scan and join rooms (connecting to each peer of the room when entering)
  • A simple P2P chat app implemented in react to test the approach and the library, which is capable of sending chat messages to represent reliable message sending, and showing the positions of the cursors of the remote peers to represent unreliable message sending.

This prototype proved that the approach could be viable. As a result, we can draw the following conclusions:

The good:

  • Peerjs was great to start the project quickly. Easy to setup and use.
  • The P2P connections worked in a number of different scenarios, be it two tabs in the same browser, two browsers in the same computer, some computers in the same LAN, and some a smartphone against a couple of computers through its 4G network.
  • The latency seemed fine even through the 4G network.

The bad:

  • Seems PeerJS doesn’t allow multiple channels per connection.
  • We lack control on how the connections are established, some of the details and the properties. The unreliable channels are not reported as unreliable through chrome://webrtc-internals. But it seems all connections are UDP so it may be OK.

There are a couple of unknowns that didn’t make it to the test, or that emerged as a result of it, namely:

  • Still researching if peerjs library is working as expected for unreliable channels
  • What’s the scalability? How many connections can a peer keep? How much information can the connections transfer?
  • What’s the optimum way of handling the use case of leaving/entering rooms in order to keep/close the connections?
  • What are the requirements for the connection to be established without a TURN server? It seems that if at least one of the peers is not behind NAT (like the case of the smartphone through its 4G network), then it works seamlessly. What about firewalls? We should have an exact knowledge of the requirements, to be able to communicate them to the users.

The prototype provided valuable insight regarding the viability of the proposed solution. With that information we started working on the actual kernel/lighthouse project:

  • Discussed design decisions and write/communicate them
  • Created a list of tasks and estimate them
  • Created the lighthouse repo and project and start implementing the tasks
  • Extracted the “Peer” library from the prototype project
  • Branched the kernel project in order to start working on it in parallel
  • Keep testing with the prototype in order to pin down some of the unknowns
    • Test scalability and limits
      • Max number of concurrent connections
      • Max number of “messages” per second in a reliable connection
      • Performance degradation in unreliable connections (high latency, jittering)
    • Connections through symmetric NAT
    • Connections through pre-installed firewalls of Windows and MacOS
    • See if it is possible to use two data channels through the same connection. Maybe patching PeerJS.
    • Figure out a strategy to decide which connections to keep. “Sync” (connect/disconnect to peers) connections each time a room is joined or left. Explore the idea of “dormant” connections (if the limits are reasonable).

Future Work

  1. Do we need Lighthouse to act as a stun?
  2. Scene messages? Authoritative?
    1. Multiplayer 1pgr
  3. Security Review
  4. Persistence of rooms
  5. Load tests
  6. How will be ensure that server versions are updated when needed?