[DAO: bafkrei] Creation of DAO-Owned Data Aggregation Layer for Decentraland Core

by 0xe400a85a6169bd8be439bb0dc9eac81f19f26843 (howieDoin)

Should the following Tier 4: up to $60,000 USD, 6 months vesting (1 month cliff) grant in the Platform Contributor category be approved?

Abstract

We believe Decentraland player usage data should be collected by the DAO and made available to all Decentraland users.

User location data is currently available through use of the /islands endpoint on DCL content server nodes. This information can be collected periodically (e.g. every few seconds) and stored in a database to be served up via API queries available to the Decentraland community. It is not feasible for multiple parties to independently collect this information as it would cause undue strain on Decentraland infrastructure.

This data should be collected and owned by the DAO, and therefore be owned by Decentraland’s users and not a private company or individual.

Grant size

57,300 USD

Beneficiary address

0xe64581F067Cfdce58657E3c0F58175e638C30f2B

Email address

howie@atlascorp.io

Description

Why collect this data and make it available?

As Atlas CORP has been in the analytics business for 18 months, we know that there are some immediate benefits that could be recognized by the Decentraland community through the creation of this service:

1. Grow the number of successful Decentraland Builders

Move the conversation from “why Decentraland?” to “why your use case?” for builders looking to win clients or raise funds.

Providing high level Decentraland statistics will increase the growth and success of builders. Often the first question asked of teams looking to build or sell in Decentraland focuses on the metaverse itself and not the team’s use case; although Decentraland is the most decentralized metaverse it still exists in a competitive landscape. Investors and clients often need to justify investment or choice of metaverse before they can start to focus on a team’s specific use case.

By providing data on Daily Active Users, Total User Growth of Decentraland, and traffic by parcel, builders can refer to these open source analytics instead of each team attempting to obtain them by themselves.

2. Prevent undue load on Decentraland Infrastructure

While the data in question is public, everyone cannot query this data for themselves without an adverse impact on Decentraland nodes.

There has already been discussion in the forums about shutting down external access to player position data due to increased loads felt by Decentraland content server nodes. Too many concurrent requests to these endpoints would have the effect of a DDoS attack which could impact the quality of the service each node can provide.

Therefore instead of a) playing out the Tragedy of the Commons that would occur if everyone collected the data for themselves, b) removing access such that nobody can benefit from this data, or c) allowing the data to be acquired by the highest bidder/private entity – we believe the ecosystem will benefit most from the data collection being done once and everyone sharing in access to that data.

3. Prevent Private Monopolization of the Data

We at Atlas CORP have first-hand knowledge of how valuable this data can be to those operating in Decentraland. User position data can be used to determine how many users attended events – critical for event hosts to understand how well their event performed. Daily active user data is crucial to those seeking to invest in the metaverse to help understand returns on investment.

We believe that no private institution should be able to gate-keep this information from the rest of the community. We at Atlas CORP used to make this information freely available using our own private hosting infrastructure, but we outpaced our capacity requiring a move to more dynamic scaling solutions. This is why we’re here talking about this proposal.

tl;dr

The DAO can provide the Decentraland community a free source of user data via API for up to one year for the cost of $57,300. The existence of this data set will help to grow the builder and entrepreneurial community by providing important metrics needed to win clients and funding, and prevent monopolization by private entities. Atlas CORP is the suitable candidate for this development given an extensive history in Decentraland data collection and analytics.

Specification

What Data is in Scope?

This proposal is only for Decentraland user data as reported by the /comms/islands endpoint on Decentraland Content Servers.

This proposal does NOT include:

  • Content Server data (e.g. scene files, user profile history)
  • Scene-specific data (e.g. object clicks and interactions)
  • Any personally identifiable information (PII), excluding ETH wallet address
  • User IRL location data
  • Any derived content from a user’s ETH wallet

Data will be collected from all active, registered Decentraland Content Servers which at the time of writing includes hephaestus, hela, heimdallr, baldr, loki, dg, odin, unicorn, marvel, and athena.

This data set can provide a platform for building more sophisticated reports by the DCL developer community. The DAO could also choose to one day monetize access to this data (e.g. when the data is being used directly for profit), as well as augment what data is collected and made available. It is important to note that this proposal is currently limited in scope to the one data set and free access to the community, and that these next steps may be the subject of future proposals.

How will this work?

Data will be collected every 20 seconds and piped into a Mongo Atlas cloud database, with a Digital Ocean server set up to provide API access to queries on the data.

An automated feed will be set up to collect data from an authoritative source of each active DAO node. Collecting data every 20 seconds will result in 3 datapoints a minute; graphs using one-minute granularity will have 3 datapoints on which to aggregate data points per minute. Data collection will be set up redundantly on two or more servers, or on a single load-balanced cluster, to minimize downtime in data collection. This data is expected to grow at 1Gb per day, which may accelerate with the growth of daily active users.

As the data is natively in JSON, we propose to use Mongo Atlas as the cloud database of choice. The database will also be deployed with multiple nodes to minimize downtime. Mongo provides an easy way to scale for future needs – whether through increasing storage capacity or sharding the deployment for increased API and query load.

To minimize infrastructure costs, we propose keeping only 3 months of data available for public consumption. A process will be designed to backup, purge, and post/host historical data such that users can perform historical analysis without excessive cost to the DAO.

An API server will be written to provide simplified API access to the data in the database. These queries may include things like – users per minute (global or per parcel), daily active users (global or per parcel), and unique Decentraland visitors (global or per parcel). The API code can be made open source and hosted on GitLab, but hosted privately on Digital Ocean to prevent unauthorized access to the DAO’s database. Additional access can be granted to additional DAO representatives if deemed appropriate.

API access will remain open, although a throttling mechanism will be put in place per IP address to prevent DDoS of the API servers. In addition, query data (e.g. who’s asking for what) could be saved in the database and made available as APIs for complete transparency.

A dashboard will be made available to show recent Decentraland population data and daily active users for the reportable data set.

Personnel

The Development Team

The development of this data platform would be done by the Atlas Corporation @atlascorp_dcl team, who have extensive experience working with this data set:

  • HowieDoin – Lead Analytics & Infrastructure innovation
  • MorrisMustang – Lead DCL & Solidity innovation
  • JosephAaron – Operations and task management
  • StaleDegree – Senior Solidity/UI Development
  • RyanNFT – Junior Developer

Roadmap and milestones

What will this cost the Decentraland DAO?

This project will cost the DAO $57,300 and will take approximately 3 months to develop.

This breaks down into:

$7,500.00* - Infrastructure and Hosting Costs:

  • Budget for Mongo Atlas DB for one year
  • Budget for Digital Ocean API servers for one year
  • Budget for Digital Ocean Data collection servers for one year
  • ENS domain for three years

*This estimate is based on current pricing for each of the above platforms. $7,500 may not last a full year if pricing is altered by the provider.

$45,000.00- Development Costs (estimated 3 month delivery)

  • Data collection with redundancy and failover to set up the data collection
    rails
  • API query development resources to produce queries and prevent
    overuse
  • DevOps and infrastructure development resources to automate as much
    as possible
  • Technical Writing resources to provide user-facing API documentation
  • Front-end development resources to create the dashboard

$4,800 - Ongoing Support Costs (3 months post deployment)

  • Code updates when breaking changes occur due to external forces
  • API query user support in the Decentraland discord

Vote on this proposal on the Decentraland DAO

View this proposal on Snapshot

In general I think this a great idea, but isn’t the /comms/islands endpoint being deprecated in a few weeks as part of changes to the infrastructure?

That would be the second time the Foundation has recommended removing that endpoint, to which the community needs to say NO.

Initially the discussion was around the burden that end point puts on the servers. This is the precise reason for this proposal. That endpoint needs to be saved, so developers can access that data from a cached layer, that can be scaled to meet the needs of app developers, without causing conflict for those in world.

The Foundation does not explicitly have control of the road map as it relates to the Decentraland protocols, which includes the content server and its API. That API endpoint is what has made it possible for us to do the work we do in Decentraland for the last 18 months.

Removing that end point means the data is now no longer open and accessible. It would then only be available to the Foundation, as they are able to gather directly from the client.

1 Like

but from my understanding it’s kind of already happening - see the catalyst channel in the main dcl server

So much for users being in control of their data. This completely unacceptable and not in line with the mission they are projecting, nor were the key players who make use of that data consulted in this process. A major misstep by the catalyst team. Disenfranchising developers who have worked hard on platform developments is not the way. The Foundation team should not be deprecating that end point. Closing that end point is an attempt to control the data that should be open, and should be rejected by the community.

1 Like

so maybe first proposal should be to stop that endpoint from being deprecated - maybe these could be combined though, or if the DAO takes over handling stats in an opensource way that would also be fine - but yeah, i agree, just losing access to that data with almost no warning is pretty lame

1 Like

not gonna lie a lot of this goes over my head but @MorrisMustang and @HowieDoin have been amazing to the Decentraland Community and I have confidence they will execute this properly, so this is an easy YES vote from me!

1 Like

Lol when 2 people can pass the proposal. Joke ass vp system. give these boys the money collect that data.

@MorrisMustang @HowieDoin this got sidestepped in other discussions, but my big question here is if ATLAS starts collecting data and distributing it as the main proxy of the catalyst server infrastructure as a whole, will we still have access to the same granularity of data as we do now in a public way?

it sounds like the API you would devise just gives access to relatively basic pre-computed stats, but maybe i have that wrong.

Currently, at a granularity of 20 second intervals, the data is about 1 gig per day with the current user base. Would prefer to be saving data at 5 second intervals, but that would require a significantly increase in infrastructure costs. This is definitely up for discussion, and its a balance between costs and benefits that can be decided by the community as puts the end points to use.

As this data aggregation layer is built out, our team would love feedback from the community about what queries should be supported. The data will be stored raw, with queries built to unpack the data into different useful insights. All of the data will also be available for bulk download, which would allow you to spin up an independent database to build whatever queries you may want. In an ideal world, community developed queries could be added to the protocol.

1 Like

I voted YES on this proposal - The Atlas Corp team, to my knowledge, has been a provider of ad-hoc reports to multiple organizations & individuals in Decentraland for various land KPI metrics - pulling & transforming data from a source that a normal user may not easily be able to access and query.

Decentraland users should have the ability to access real-time data to utilize for their independent reporting needs (land activity, understanding total activity in DCL at a given day/place, marketplace data). Personally, I would love to pull data and perform my own analytics over Decentraland’s land activity for education purposes. I’m sure there are a few data analysts in our community who may feel the same.

From my perspective, this API database to be created by Atlas Corp will help users access a raw dataset & then transform the data and analyze it for their needs. This can enable data scientists and developers in our community to band together and help create dashboards that may help the broader community understand the current stats of DCL.

I would love to see a few members of the community get together & create a focus group to help Atlas with their development of what data points would be most beneficial for this database to make this an efficient dataset, or reducing data points that may not be as crucial.

The budget seems reasonable given the scope of this project.

As long as the Atlas team is able to access the data, this proposal is needed for decentralized ad-hoc data reporting. I am not familiar with the change in the catalyst infrastructure in place - this might be another proposal or discussion.

The other solution is for the community to rely on an organization who has high-level reporting dashboards, reporting on an XXX basis. However, some developers may only need specific parcel data, or drilled down details that may not be reflected into a “basic” report.

Just my two cents…

Thanks,

Maryana

2 Likes

Based on the reputations of the authors and those who voted in favor of this proposal, I’m voting YES. Admittedly this goes a bit above my head, but I believe the data should be freely accessible and not only accessible to the foundation.

I appreciate the in depth explanations from Morris and Maryana. This also helped me to make this decision.

Creation of DAO-Owned Data Aggregation Layer for Decentraland Core

This proposal is now in status: PASSED.

Voting Results:

  • Yes 82% 3,155,832 VP (71 votes)
  • No 18% 712,823 VP (4 votes)

Creation of DAO-Owned Data Aggregation Layer for Decentraland Core

This proposal has been ENACTED by a DAO Committee Member (0xfe91c0c482e09600f2d1dbca10fd705bc6de60bc)

Vesting Contract Address: 0x6141047169e6df0822b47687a1be516b2bd28d29
View Transaction

RESPONSE TO UPDATE #1

On the first update of this proposal, a list of challenges/issues was reported, below you can find the response for those issues.

Issue 1: 400-type errors indicative of unauthorized access

After some experimentation and community outreach, adding a User-Agent header fixed the issue. While some consider this best practice, there is no mention of this being necessary in the documentation.

Answer: The User-Agent header is not required to retrieve the specified information from the Catalysts nodes. Below there are some examples that poove the point and work as expected, returning the APIs result from any Catalyst node, this CURL commands can be tested with any DAO node and should respond as expected.

curl -H "User-Agent: "  -v "https://peer.decentraland.org/comms/peers"  
curl -H "User-Agent: "  -v "https://peer.decentraland.org/comms/islands"  
curl  -v "https://peer.decentraland.org/comms/islands"   
curl  -v "https://peer.decentraland.org/comms/peers"   

We will highly appreciate it if you can provide more context and examples to be able to reproduce this issue in order to update the documentation and/or fix any existing bug if necessary. If a node is not responding as expected, the owner will need to be contacted to check what is going on.

Issue 2: Several 529 errors (“Too many requests”)

returned as an NGINX HTML page. This error is quite curious as our data was captured from each node at consistent 20-second intervals. This would seem to indicate either misuse of the standard error code (there is a different reason access was withheld), or a policy that declares too many requests across all users of the API. In other words, in the latter case one bad actor can disable data collection for everyone, instead of simply throttling the bad actor.

Answer:
What’s happening?
The rate limit is applied at the endpoint level across all requests done due to the processing cost of the endpoint. At the time of the limit configurations, the baseline was the current consumption with an added 30% capacity to prevent affecting any existing client, the result was a limit of 40 req/min between /comms/islands and /comms/peers and this has been working well until now

Why?
To protect the nodes a general rate limit will always be needed and IPs can be easily faked, if both rates limit are applied, one per IP and one Global, a bad actor would still be able to deny the service to other users by doing a DDoS. On the other hand, the Foundation nodes are behind Cloudflare, this also presents a challenge to create a canonical rate limit configuring by IP due to how Cloudflare manages the HTTP headers moving the origin IP to a different header. If the mentioned header variable is used for the rate limit configuration, then the rest of the community nodes that are not behind Cloudflare would be affected by this setting.

How to work this around?
Rate limit increased 5x: 200 req/min
On Oct 18th, Catalysts nodes were updated with this new rate limit increase. This new setting should provide enough bandwidth for the proposed use case, this takes into consideration the amount of rate-limited requests quantity and there will be plenty of capacity to respond.

Long Term Solution
Implement an allow list by an API key that should be requested through a DAO poll with a specified service quota for a specific actor and let the community decide who should have the keys.
Rethink how the data is calculated and cached on the server side to avoid unnecessary processing.

Issue 3: DNS handshake failures

Usually indicative of the node being offline and unable to be found via their url. Given that we trust that these nodes to be up and available for Decentraland users to have a good user experience, it would be good to know details of any SLAs (e.g. uptime requirements) that node operators commit to and to understand who is meeting these conditions.

What’s happening?
We need to embrace the concept of decentralization not only from an infrastructure but also from a governance perspective. Nowadays the Catalysts servers both hosted by the Decentraland Foundation and the community do not have an SLA in terms of reliability (uptime, latency, error rate, and saturation). We are more than happy to receive proposals on how to implement this improvement opportunity but nowadays there is no guarantee for the node’s uptime and if a node is malfunctioning the community members may create a poll to remove it from the DAO network. Currently, this doesn’t represent an issue due to the commitment of the catalyst owners and the communication channels available to tackle issues with the nodes. We would love to understand more about the issues you are experiencing related to DNS handshake and discoverability.

Why?
Having a decentralized network allows some nodes to be offline while providing the same good user experience for the end users. The load balancing and routing algorithms act as a derisk mechanism to guarantee the player experience. Nodes can be offline for some time for several reasons like changing a disk, updating, doing a migration, patching, or even due to an attack. There are ways to avoid this downtime but they are not currently implemented.

How to work this around?
From the metrics point of view, if the server is not responding, it can be assumed that it won’t have any peers or islands. In these cases, we recommend your team to implement retry with an exponential backoff and jitter logic.