[LADIS 2009] Technical Session #5 – Communication
First Talk: Bulletin Board: A Scalable and Robust Eventually Consistent Shared Memory over a Peer-to-Peer Overlay by Gregory Chockler
WebSphere Virtual Enterprise (WVE) is a product for managing resources in a data center. The product is a distributed system whose nodes and controllers need to communicate and share information, and BulletinBoard (BB) is used for that. BB is a platform service for facilitating group-based information sharing in a data center. It is critical component of WVE, and its primary application is monitoring and control, but the designers believe that it could be useful for other weakly consistent services.
Motivation & Contribution: Prior implementation of group communication implemented internall as not designed to grow 10 folds, and that was based on Virtual Synchronous group communication; robustness, stability, high runtime overheads as the system grew beyond several 100s of processes; static hierarchy introduced configuration problems. So the goal was to provide a new implementation to resolve the scaling and stability issues of the prior implementation (and implement this in a short time! so this constraint had important implications on the design decisions).
BB supports a write-sub (write subscribe) service model. It is a cross between pub-sub systems, shared memory systems, and traditional group communication systems. In pub-sub communication is async and done through topics. In shared memory we have overwrite semantics, singe writer per topic and process, and notifications are snapshots of state.
Consistency Semantics (single topic). PRAM Consistency: notified snapshots are consistent with the other process order of writes. A note made was that developers who built services on top of that turned out to understand this semantics of consistency.
Liveness Semantics (single topics). Uses Eventual inclusion: eventually each write by a correct and connected process is included into the notified snapshot. Eventual exclusion means that failed processes will be eventually excluded from updates.
Performance and Scalability Goals: adequate latency, scalable runtime costs, throughput is less of an issue (management load is fixed and low). Low overhead. Robustness, scalability in the presence of large number of processes and topics (2883 topics in a system of 127 processes, note that the initial target was around 1000 processes).
Approach: decided to build this on an overlay network called SON. Service Overlay Network (SON). SON is a semi-structured P2P overlay, already in the product, and self-* (recover from changes quickly without problems), resilient, and supports peer membership and broadcast. The research question here was whether if BB can be implemented efficiently on top of a P2P overlay like SON?
Architecture: SON with IAM (interest aware membership) built on top of it and BB on top of that (but BB can interact directly with SON).
Reliable Shared State Maintenance in SON for BB: is made fully decentralized, and update propagation is optimized for bimodal topic popularity. Overlay broadcast or iterative unicast over direct TCP connections if # subscribers of a topic is less than a certain threshold (and group broadcast otherwise). For Reliability, periodic refresh of the latest written value (on a long cycle) if not overwritten (this was a bad decision in retrospect) with state transfer to new/reconnected subscribers.
Experimental study on different topologies showed low cpu overhead and latency, but these numbers increased as the topology increased in size. Analysis of that revealed that this was because the periodic refreshes were stacked and caused increased CPU & latency overheads. An additional problem was in broadcast flooding, and when that was removed cpu & latency overheads stayed flat as the topology increased in size.
Lessons learned: communication cost is the major factor affecting scalability of overlay based implementations, and that anti-entropy techniques are best fit for such services.
Second Talk: Optimizing Information Flow in the Gossip Objects Platform by Ymir Vigfusson
In gossip, nodes exchange information with a random peer periodically in rounds. Gossip has appealing properties such as bounded network traffic, scalability in group size, robustness against failures, coding simplicity. This is nice when gossip is considered individually per application. In cloud computing with nodes joining many groups, the traffic is no longer bounded per node (but per topic).
The Gossip Objects (GO) platform is a general platform for running gossip for multiple applications on a single node. It bounds the gossip traffic going out of a particular node. The talk focused on how to select rumors to publish out from multiple applications on a single node such that we reduce number of messages. This is possible because rumor messages are small and have a short destination. An observation made is that rumors can be delivered indirectly, uninterested nodes can forward rumors to interested nodes.
The GO heuristic: recipient selection is biased towards higher group traffic. The content is selected by computing a utility of a rumor which is defined as the probability of that rumor will add information to a host that didn’t know that info.
Simulation, a first simulation of an extreme example with only two nodes joining many groups. The GO heuristic showed promising results. Then a real-world evaluation was conducted based on a 55 minute trace of the IBM WebSphere Virtual Enterprise Bulletin Board layer. The trace had 127 nodes and 1364 groups, and the evaluation showed that GO had placed a cap on traffic compared to random and random with stacking heuristics for GO. Additionally, the GO heuristic was able to deliver rumors faster than the other heuristic, and the number of messages needed to deliver the messages to interested nodes, and the GO heuristic had multiple orders of reduction over other heuristics and traditional rumor spreading.
Conclusion: GO implemented novel ideas such as per-node gossip, rumor stacking (pushing the rumor to the MTU size), utility based rumor dissemination, and adapting to traffic rates. GO gives per-node guarantees even when the # of groups scales up. Experimental results were compelling.
Questions:
Mike Spreitzer, IBM Research: What would happen if the number of groups increases?
Answer: Study of available real-world traces showed a pattern of overlap. We also conducted simulation with other group membership patterns and the results were similar.
—: What was the normal rumor size? And what would happen if that increased?
Answer: The average rumor size was 100Bytes. If the message size increased we will stack less rumors, but our platform can also reject really large rumors.
—: Have you thought about network-level encoding?
Answer: Not yet, but we plan to in the future.
—: Have you thought of leveraging other dissemination techniques to run under GO?
Answer: Actually, we thought about the opposite direction where we would run other communication protocols and map them under the hood to GO. Results are pending.