Goal: Improve the decentralized spam filter in Freenet (WoT) to have deterministic network load, bounded to a low, constant number of subscriptions and fetches.
This article provides calculations which show that decentralized spam filtering with privacy through pseudonyms can scale to communication systems that connect all of humanity. It is also applicable to other systems than Freenet, see use in other systems.
Originally written as a comment to bug 3816. The bug report said "someone SHOULD do the math". I then did the math. Here I’m sharing the results.
Useful prior reading is Optimizing a distributed spam filter for Freenet.
This proposal has two parts:
⁰: https://en.wikipedia.org/wiki/Dunbar's_number - comment by bertm: that assumes all statements of "OwnID trusts ID to not be a spammer" to be equivalent to "OwnID has a stable social relationship with ID". I'm not quite sure of that equivalence. That said, for purposes of analysis, we can well assume it to be bounded by O(1).
Subscribe to all rank 1 IDs (which have direct trust from your OwnID). These are the primary subscriptions. There are N primary subscriptions.
All the other IDs are split into two lists: rank2 (secondary IDs) and rank3+ (three or more steps to reach them). Only a subset of those get subscriptions, and the subset is regularly changed:
Also replace one of the randomly chosen rank2 and rank3+ subscription every hour. This ensures that WoT will always eventually see every update.
If any subscription yields an update, download its key and process all edition hints. Queue these as fetches in separate queues for rank1 (primary), rank2 (secondary), and rank3+ (random), and process them independently.
At every update of a subscription (rank1, rank2, or rank3+), choose F fetches from the respective edition hint fetch queue at random and process them. This bounds the network load to ((N × F) + (4M × F)) × update frequency.
These fetches and subscriptions must be deduplicated: If we already have a subscription, there’s no use in starting a fetch, since the update will already have been seen.
To estimate an upper bound for the fetch frequency, we can use the twitter frequency, which is about 5 tweets per day on average and 10 to 50 for people with many followers¹ (those are more likely to be rank1 IDs of others).
There are two possible extremes: Very hierarchic trust structure and egalitarian trust structure. Reality is likely a power-law structure.
For high frequency subscriptions (most recently updated) we can assume 4 updates per hour for 16 hours per day, so 64 updates per day.⁰ For random subscriptions we can assume 5 updates per day (as by ¹).
¹: http://blog.hubspot.com/blog/tabid/6307/bid/4594/Is-22-Tweets-Per-Day-the-Optimum.aspx ← on the first google page, not robust, but should be good enough for this usecase.
((N × F) + (M × F)) × trustee update frequency + 2M × F × high update frequency + 2M × F × random update frequency.
For a very hierarchic WoT (primaries are very active) this gives the upper bound:
= (150 × 10 × 22) + (10 × 10 × 22) + (10 × 10 × 64) + (2 × 10 × 10 × 5) + (10 × 10 × 64)
= (1500 × 22) + (100 × 22) + (100 × 64) + (100 × 5) + (100 × 64)
= 33000 + 2200 + 6400 + 500 + 6400 # primary triggered + random rank2 + active rank2 + random rank3+ + active rank3+
= 48500 fetches per day
~ 34 fetches per minute.
For an egalitarian trust structure (primaries have average activity) this gives the upper bound:
= (150 × 10 × 5) + (10 × 10 × 5) + (10 × 10 × 64) + (10 × 10 × 5) + (10 × 10 × 64)
= (1500 × 5) + (100 × 5) + (100 × 64) + (100 × 5) + (100 × 64)
= 7500 + 500 + 6400 + 500 + 6400 # primary triggered + random rank2 + active rank2 + random rank3+ + active rank3+
= 21300 fetches per day
~ 15 fetches per minute.
This gives a plausible upper bound of the network load per day from this scheme, assuming a very centralized WoT. The upper bound for a very hierarchic trust structure is dominated by the primary subscriptions. The upper bound for an egalitarian trust structure is dominated by the primary subscriptions and the high frequency subscriptions.
The rank2 subscriptions and the random subscriptions together make up about 5% of the network load. They are needed to guarantee that the WoT always eventually converges to a globally consistent view.
One fetch for an ID transfers about 1KiB data. For a hierarchic WoT (one fetch per two seconds) this results in a maximum bandwidth consumption on a given node of 1KiB/s × hops. This is about 5KiB/s for the average of 5 hops — slightly higher than our minimum bandwidth. For an egalitarian WoT this results in a maximum bandwidth consumption on a given node of 0.5KiB/s × hops. This is about 2.5KiB/s for the average of 5 hops — 60% of our minimum bandwidth. The real bandwidth requirement should be lower, because IDs are cached very well.
The average total number of subscriptions to active IDs should be bounded to 190.
⁰: The cost of active IDs might be overestimated here, because WoT has an upper bound of one update per hour. In this case the cost of this algorithm would be reduced by about 30% for the egalitarian structure and by about 10% for the hierarchic structure.
The process to check IDs with rank >= 2 can be improved from essentially checking them at random (with the real risk of missing IDs — there is no guarantee to ever check them all, not even networkwide), to having each active ID check all IDs in O(N) (with N the number of of IDs).
When removing a random subscription to an ID with rank2 or higher, with 50% probability add the ID+currentversion to a blocklist which avoids processing this same ID with this or a lower version again and prune it from the WoT.¹
When receiving a version hint from another ID with a higher version than the one which is blocked, the ID is removed from the blocklist.
The total cost in memory is on the order of the number of old IDs already checked, bounded to O(N), the number of Identities.
¹: Pruning the ID from WoT is not strictly necessary on the short term. However on the long term (a decade and millions of users), we must remove information.
Assume that 9k of the 10k IDs in WoT are stale (a reasonable assumption, because only about 300 IDs are inserted from an up to date version of WoT right now).
When replacing one random rank2 and one random rank3+ subscription per hour, that yields about 16k subscription replacements per year, or (in a form which simplifies the math) about two replacements per ID in the WoT.
For the first replacement there is a 90% probability that the ID in question is stale, and a 50% probability that it will be put on the blocklist if it is stale, which yields a combined 45% probability that the number of stale IDs decreases by one. In other words, it takes on average 2.2 steps to remove the first stale ID from the IDs to check.
As a rough estimate, for 10 IDs it would take 15 steps to prune out 5 of the 9 stale IDs. Scaling this up should give an estimation of the time required for 9k IDs. So after about 15k steps (one year) half the stale IDs should be on the blocklist.
For a given stale ID, after one year there is roughly a 50% chance that it is on the blocklist of a given active ID. But the probability that it is on the blocklist of every active ID is just about 0.5k, with k the number of active IDs. So when there is an update to this previously stale ID, it is almost certain that some ID will see it and remove it from the blocklists of most other IDs within O(N) steps by providing an edition hint (this will accelerate as more stale IDs are blocked).
I am sure that there is a beautiful formula to calculate exactly the proportion of subscriptions to stale IDs we’ll have with this algorithm when it entered a steady state, and the average discovery time for a previously stale ID to be seen networkwide again when it starts updating again. To show that this algorithm should work, we only need a much simpler answer, though:
How long will it take an ID which was inactive for 10 years to be seen networkwide again (if its direct trusters are all inactive, else the primary subscriptions would detect and spread its update within minutes)?
After 10 years, the ID will be on the blocklist of 99.9% of the IDs. In a network with 10k active IDs, that means that only about 10 IDs did not block it yet¹. Every year there is a 50% probability for each of the IDs that the update will be seen.
Therefore detection of the update to an ID which was inactive for 10 years and whose direct trusters are all inactive will take about 10 weeks. Then the update should spread rapidly via edition hints.
¹: There is a 7% probability that 15 or more IDs could still see it and a 1.2% probability that less than 5 IDs still see it. The probability that only a single ID did not block it yet is just 0.005%. In other words: If 99% of IDs would become inactive and then active again after 10 years, approximately one will need about two years to be seen and most will be detected again within 10 weeks. Therefore this scheme is robust against long-term inactivity.
This algorithm can give the distributed spam filter in Freenet a constant upper bound in cost without limiting interaction.
Use Node:
⚙ Babcom is trying to load the comments ⚙
This textbox will disappear when the comments have been loaded.
If the box below shows an error-page, you need to install Freenet with the Sone-Plugin or set the node-path to your freenet node and click the Reload Comments button (or return).
If you see something like Invalid key: java.net.MalformedURLException: There is no @ in that URI! (Sone/search.html)
, you need to setup Sone and the Web of Trust
If you had Javascript enabled, you would see comments for this page instead of the Sone page of the sites author.
Note: To make a comment which isn’t a reply visible to others here, include a link to this site somewhere in the text of your comment. It will then show up here. To ensure that I get notified of your comment, also include my Sone-ID.
Link to this site and my Sone ID: sone://6~ZDYdvAgMoUfG6M5Kwi7SQqyS-gTcyFeaNN1Pf3FvY
This spam-resistant comment-field is made with babcom.