Point-To-Point Multicast Atom Network
From CommerceNet Wiki
(the first version of this page was by Kragen Sitaker); see also Publications
We've been informally calling this proposal "Benzine" because the title of this page is too long and generic.
[edit] Point-to-point multicast event notification for Atom
Suppose neighbors Adam, Beatrice, and Connie all read some Atom feeds (Atom is the current version of RSS), with some overlap, but not exactly the same set. Atom works exclusively by means of polling, which means there's an inverse relationship between observation latency and bandwidth usage --- and half the bandwidth usage is paid for by the publisher of the Atom feed. Standard practice is to poll the feed every hour, but this is starting to cause performance problems.
It's wasteful for each of Adam, Beatrice, and Connie to poll the same feed, each once an hour (so once every twenty minutes, on average), but still each getting the same observation latency as ever. It would work much better if, when Adam polled a feed and found an update, he told Beatrice and Connie about the update --- effectively giving everyone twenty-minute observation latency.
Clearly, this involves some new security and performance risks, which I will discuss below, in the "Threat model" and "Miscellaneous Desiderata" sections.
All three establish TCP connections with each other, and tell the others what feeds they subscribe to. Now, when Beatrice discovers that a feed has changed, either by receiving a message from Connie or by polling it themselves, she sends a message announcing this to Adam (if he's told her he's interested) and Connie (if she's interested and hasn't seen the message.) Note that this requires at least one of the neighbors to remember what messages they've received or sent in the past to avoid a forwarding loop.
If Adam tells Beatrice he's interested in a particular topic, Beatrice doesn't necessarily propagate that expression of interest to Connie, but under certain circumstances described later, she might.
I'll refer to the set of nodes connected in this way as "the crowd" (after Reiter and Rubin's Crowds system) and the nodes that have direct connections to my node as my "neighbors".
[edit] Generic threat model
[edit] Attacks
I don't want anybody to know what feeds I subscribe to; I don't want anybody to be able to make me think a feed has been updated when it hasn't, or hasn't been updated when it has; I don't want the system to make denial-of-service attacks easier, either on participants or nonparticipants, but I don't think I can design a system that makes DoS attacks impossible on top of IP.
I don't care if someone can tell who I'm talking to.
[edit] Adversaries
There are five major classes of adversaries to defend against:
[edit] Bill, the abusive husband
Bill installs software as root on his wife's computer to make sure he knows everything she does with it: what she reads, what she writes, when she uses it.
[edit] Echelon
Echelon can see all traffic in and out of every node on the network. Optionally, it can also selectively shut off and insert traffic. (This is a stronger adversary than the real-world Echelon system, which only monitors a small number of links.)
[edit] The BOFH
The Bastard Operator From Hell can see all traffic in and out of a particular node on the network, and also shut off and insert traffic.
[edit] Rogue Nodes
Rogue nodes are nodes engaged in a conspiracy against you: to watch what you read, to censor what you read, to falsify what you read, and to waste your effort.
[edit] The Paranoid Fantasy
The Paranoid Fantasy is a network in which all the nodes in the crowd, except your own, are rogue nodes, engaged in a single-minded conspiracy against you.
[edit] Impossibilities
There's nothing you can do about Bill, except to leave him.
Echelon and the BOFH can effectively cut you off from the whole crowd, but it is possible to build a system that does not allow them to selectively censor information --- except by putting you in the world of the Paranoid Fantasy. However, they can alter or censor Atom feeds you poll directly.
The Paranoid Fantasy can prevent you from using the network to gain anything useful, but you can still poll the feeds yourself. However, as concerns your communication with the network, you have only the right to remain silent; anything you say to another node may be used against you in a court of law.
Any system that can't deal with some fraction of rogue nodes can't function. It should be possible to provide all the security properties at modest cost in a system with rogue nodes.
[edit] Miscellaneous desiderata
Polling a lot of Atom feeds every hour uses a lot of bandwidth, so it's fine if a better solution still uses a lot of bandwidth, as long as it's not more. However, it needs to support networks including nodes of widely varying bandwidths --- from 10kbps to 10Gbps, at least.
Even feeds with relatively few subscribers ought to have their service improved by this system.
It would be nice if, in addition to providing quicker updates, the system also resulted in less traffic on the origin web site, so as to decrease the cost of publishing an Atom feed. A feed with 600 subscribers, for example, could have each subscriber poll, say, once every 20 hours, achieving observation latency of two minutes while reducing costs to the publisher by a factor of 20.
[edit] Privacy by padding
I subscribe to many email mailing lists I rarely read. But someone watching my network usage wouldn't be able to tell which lists I read avidly, but never post to, and which I never read at all. Now, because I actually subscribed to these lists because I was interested in them, an observer could form a fairly accurate picture of my interests from the list of mailing lists --- but, if I wanted to protect my privacy in this way, I could just as well have subscribed to a lot of randomly selected mailing lists. (If there were a random way to select mailing lists, that is.) A BOFH trying to infer my interests would have a hard time figuring out which 1% of the mail I received was actually of interest to me.
However, A BOFH aware of which mailing lists related to similar interests might still be able to figure it out --- if they knew which lists I was subscribed to, what population they were randomly selected from, and with what random distribution, they could compute the probability that a particular set of closely related mailing lists was subscribed to at random. I think this is a marginal attack, but it's worth mentioning.
In the case of mailing lists and Atom feeds, this technique is very inconsiderate. If everyone subscribes to 100 times as many (random) mailing lists as they're actually interested in, then each mailing list has to send out (on average) 100 times as much mail as they should. But if you're a student at a university, it's perfectly reasonable to subscribe to 100 times as many Usenet newsgroups as you actually read, since there's essentially no cost to this subscription.
[edit] Encryption
Point-to-point connections inside the crowd should be strongly encrypted with MACs to make life harder for the BOFH and Echelon. This prevents them from examining or altering traffic, reducing their capabilities to cutting off connections and performing traffic analysis.
[edit] Prefix matching for privacy
If Adam tells Beatrice he's interested in an Atom URL about the fight against Scientology, and Beatrice happens to be a rogue node, she may report him to the Scientologists, and Adam may be investigated and become the victim of a "fair game" character-assassination campaign. Padding privacy in this case may be helpful: if Adam instead tells Beatrice he's interested in a lot of different Atom feeds, the Scientologists may leave him alone, not knowing whether he's really a supporter of Xenu.net.
However, Adam may not have a good way to come up with a random list of "padding" Atom URLs. Suppose instead he tells Beatrice, "I'm interested in any feed whose URL's SHA-1 checksum begins with the eight bits 01001101." Now Beatrice knows only that he's interested in some random 1/256 of the feeds in the world, and will forward him about 1/256 of her incoming messages, including the ones about Scientology.
If Adam feels he is getting too much traffic from Beatrice, he can tell her, "Instead of sending me everything whose SHA-1 checksum begins with the eight bits 01001101, send me everything whose SHA-1 checksum begins with the nine bits 010010110." Now he will get about half as much traffic from Beatrice.
[edit] Bloom filters for privacy
Another possibility is for Adam to give Beatrice a Bloom filter of all of his interests, which will tend to be more compact than sending her a list of a bunch of bit prefixes.
[edit] Proxying for privacy
If Adam's neighbors Beatrice and Connie aren't neighbors of one another, but they share an interest in a particular Atom feed, the system as described so far will not help them. Adam, however, can decide to express to them an interest in that same feed, making it possible for him to relay messages between them on it. If he does this occasionally, he has a second level of privacy: if he's accused of reading about Scientology, he can claim that he's actually only proxying the Scientology feed for a friend, as in Rubin and Reiter's Crowds system.
[edit] Tit-for-tat equalization
In order to function in the presence of rogue nodes, the system must not allow them to consume unbounded scarce resources, such as bandwidth, on the non-rogue nodes. I propose a mechanism based on pairwise reciprocity, similar to those in BitTorrent and Digital Silk Road.
Each node maintains a goodwill balance for each of its neighbors, which starts at 0. Whenever Connie does Beatrice a favor, such as by sending her something she had requested, or by accepting a request of hers, Connie's goodwill balance goes up. When Beatrice does Connie a favor, Connie's goodwill balance goes down. Your goodwill balance in my eyes and my goodwill balance in your eyes should add up to 0, at least approximately.
I maintain a goodwill credit limit for each of my neighbors. If your goodwill balance becomes more negative than the credit limit, I stop doing favors for you, because you aren't reciprocating; at a sufficiently low level, I will drop the connection and cease to be your neighbor, so that I don't waste any more resources processing requests from you that I don't intend to fulfill.
The goodwill credit limit starts as a small negative number, and becomes a larger negative number over time.
[edit] Misunderstandings
Occasionally, you and I will have different opinions on whether I have just done you a favor or not, perhaps because I sent you something I thought you wanted, but you didn't really want it, or I accidentally sent you a message you'd just sent me, or resent a message I'd already sent you. Usually this will result in a negative total goodwill balance between the two of us --- a number that is supposed to be zero, according to the above. Real people use two strategies to handle these misunderstandings: feedback, such as facial expressions, and forgiveness, which can take place either over time or over further transactions.
Feedback, in this system, consists of further communication after each transaction, where each of us tells the other what we think the direction and magnitude of the favor was. Whichever node can convince the other of its point of view then gains an advantage (assuming the outcome without feedback would have been negative).
Forgiveness over time could be implemented as a small goodwill increment applied to every account every time unit, or perhaps only to negative accounts.
Forgiveness over further transactions could be implemented as a small increase in the net goodwill balance resulting from each transaction, corresponding to the notion that it's less trouble for me to give you a ride than it would be for you to walk home. This allows us to directly represent some of the excess value derived from cooperation directly in the system, and may remove the need to increase karma credit limits; but the excess value that remains as goodwill in this way should only be a small fraction of the total excess value.
If nothing is done to repair misunderstandings, eventually every relationship will be destroyed by the accumulation of negative goodwill. However, all the approaches to resolving misunderstandings described above increase the amount of resources rogue nodes can unproductively consume.
[edit] Requests as offers
Once the fulfillment of requests for future information generates measured goodwill points, it makes sense to represent the requests themselves as offers of payment: "I will pay you 5 goodwill points for each message containing an update to any feed whose URL's SHA-1 checksum begins with the eight bits 01001101, for the next thirty seconds." (Perhaps represented as five bytes: 3f 05 08 8d 1e.) The point value --- probably a single extra byte per request --- makes it possible to calculate the expected goodwill payoff for any particular message I send you, in advance. Without this, feedback messages are necessary; with it, only standardized pricing for accepting request messages is necessary.
[edit] Window sizing
When I send you a request for some information, including an offer to give you goodwill each time you fulfill it, I don't yet know how much data you will send me. Because of privacy padding, you may send me a hundred times as much information as I really want, or perhaps a thousand. If I was hoping for a hundred and you send me a thousand, it may overwhelm my dialup modem line for a while, so perhaps my offer should include a maximum window of bytes it's good for without renewal: "6 goodwill points per message on URLs whose SHA-1 begins with 01001111, for the next 60 seconds, for up to 5000 bytes." (Perhaps this is 3f 06 08 8f 3c a3 04.) This allows me to limit my goodwill and bandwidth losses if you have a lot more stuff to send me than I want, and it's closely analogous to TCP window sizing.
[edit] Delays
Some delay is needed to frustrate traffic analysis in this system; otherwise, Echelon can watch encrypted data propagating through the system from its origin to its destinations to determine who has subscribed to which topics. This is a well-known requirement for anonymous remailers, and it is discussed in Chaum's original Mix paper, in the form of a "batching" requirement.
[edit] Simple pricing
In the system as described so far, individual nodes have a lot of freedom in their pricing strategies. To build the system initially, though, we need a simple pricing and selling strategy that could possibly yield a stable and robust system.
I believe that we lose no generality by assuming that Adam is equally happy to receive 5 goodwill points from Beatrice as from Connie, as long as both goodwill balances are relatively far from their goodwill credit limit.
Pricing should start relatively low and increase over time until the desired level of signal is reached. For resistance to censorship and falsification, this probably means I'm receiving each update message several times from each of several different neighbors, who I hope are independent. Eventually, I might start valuing minority opinions less.
Initially I might be just as happy to receive a message about "01001101" from either Beatrice or Connie, and therefore offer 5 points to each of them, but if I learn over time that Beatrice sends me more noise but the same or less signal, then I might offer Beatrice 5 points per message and Connie 10 (since I have to pay them whether it was a noise 01001101 message or a signal 01001101 message.) This tactic is necessary to prevent rogue nodes from getting paid unlimited amounts for generating random noise without relaying any signal.
(There's a tension here between feedback and privacy-against-paranoid-fantasy, which gets worse when you introduce proxying.)
The total cost of being interested in a feed includes three items: the goodwill cost of receiving update notifications on it, the goodwill cost of receiving update notifications that hash-collide with it, and the cost of sending out occasional expressions of interest. Accordingly the number of neighbors to which expressions of interest are sent should be incrementally increased over time, along with the pricing, until the desired traffic level is reached.
Generally, if Beatrice offers Adam 5 points for information about "7e", and he decides to ask his friends about "7e" on her behalf, he should offer them less than 5 points, perhaps dramatically less. First, if Connie and David send him the same message, he has to pay them each 5 points, but then only gets paid 5 points once by Beatrice --- she won't pay for the duplicate message. Second, he's incurring some costs by sending out offers.
Each node has a bandwidth budget, and it periodically decides how to spend it to generate the most goodwill. Using the greedy strategy for fulfilling requests is probably fine. Deciding which requests to send requires some kind of probability model that the request will be fulfilled. It's not immediately obvious to me how simple the probability model can be while remaining relatively thrifty.
When I'm receiving too much noise from you on some subscription, I have at least two ways to fix it: I can offer you less goodwill, or I can be more specific about what I want, costing me privacy. Probably I should increase my specificity until most of my feeds get down to a manageable noise level, then increase the goodwill offered to those neighbors.
[edit] Security properties of this design
So far I haven't discussed bootstrapping --- how you join the crowd and how you establish links with your neighbors --- or recovery from network failures. If bootstrapping and recovery can be attacked, the whole network can be attacked.
Echelon and the BOFH can only guess which traffic flows along which links by means of traffic analysis. The use of delays and padding limits the amount of information they can gather in this way, and consequently limits what they can censor.
Echelon, however, can still censor feeds as they enter the crowd, and they can still retaliate against nodes that poll forbidden feeds by cutting off their network traffic or sending their owners to Siberia. In the system as described so far, nobody ever polls a feed they're not themselves interested in.
Rogue nodes are far more powerful, but their power is still limited.
They can inject noise into their immediate neighborhood and get paid for it for a while, but gradually their neighbors will stop listening to them, and eventually they will lose all of their links. To some extent, they will damage their neighbors' reputations as well. (You have to be careful about passing along gossip!)
They can decline to forward traffic, but the traffic will find another route. However, by selectively declining to forward traffic, they may be able to cause other nodes to be more specific in order to increase their signal-to-noise ratios.
Leeches, who never forward any traffic and never poll any feeds themselves, will eventually be ostracized.
As far as I can tell, the system does not afford any new denial-of-service attacks. It propagates some very limited probabilistic information about who's interested in which topics, but it actually makes it harder for Echelon or the BOFH to tell who's interested in which topics.
The system as described has no automatic introduction, and therefore no way for anyone to discover where all the nodes are (in order to DoS them), but some kind of advertisement or automated introduction is probably necessary for bootstrapping, and that may be a weak point. In particular, a leech can connect to the system periodically, run up their credit limit with all their new friends, disconnect, and start again; you can limit the effect of this on the existing crowd by limiting the influx of newcomers, but you can't prevent leeches from making it harder for helpful newcomers to join the system, and the less leeching strangers damage the existing crowd, the more they damage helpful newcomers.
[edit] Performance properties of this design
When I'm interested in some feed, I gradually increase my goodwill offer until enough of the system knows about it that I'm receiving several messages for each update (modulo people's reluctance to repeat gossip). Requests for information about unpopular feeds might propagate out two or three hops before I find others interested in them, but I'm paying for all those millions of offer messages with the favors I'm doing for my friends.
[edit] Misc.
Paul Visscher points out that the bandwidth used by a feed is proportional to its update rate, but may not be proportional to its value; this may be a problem, particularly in the presence of malicious feeds.
Paul Visscher also points out that rapidly updating feeds can give rise to multiple simultaneous "most recent updates" floating around the network, which could reduce the credibility of some nodes. (I haven't talked about credibility explicitly, but I think I need to.)
[edit] Related
Zooko mentions in a post to mnet-devel that Peter Marbach of U. Toronto has some theoretical economic results that suggest a simple pricing algorithm for systems somewhat like this.
