1.10 A home on the Internet — Ludovic Courtès — Fault-tolerance in Mobile Systems
  • On the consistency problem in mobile distributed computing [gerraoui02:consistency]
    Rachid Guerraoui and Corine Hari, Proceedings of the second ACM international workshop on Principles of mobile computing, 2002
  • Impossibility of Distributed Consensus with one Faulty Process [fischer85:impossibility]
    Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson (Yale University, New Haven, CT, USA), appeared in Journal of the ACM, April 1985
  • Basic Concepts and Taxonomy of Dependable and Secure Computing [avizienis04:concepts]
    Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, Carl Landwehr, IEEE Transactions on Dependable and Secure Computing, 2004

    The ``holly bible'' in the dependability community.

  • The Transaction Concept: Virtues and Limitations [gray88:transaction]
    Jim Gray (Tandem Computers, Inc., CA, USA), appeared in Readings in Database Systems, 1988
  • An Asynchronous Recovery Scheme based on Optimistic Message Logging for Mobile Computing Systems [park00:asynchronous]
    Taesoon Park and Heon Young Yeom, International Conference on Distributed Computing Systems, 2000
  • An Efficient Recovery Scheme for Mobile Computing Environments [park01:recovery]
    Taesoon Park, Namyoon Woo, Heon Y. Yeom (Sejong University, Korea; Seoul National University, Korea), Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS), 2001

    This paper describes a checkpointing scheme for mobile nodes in a cellular network. Mobile hosts (MHs) are assumed to be in contact with static mobile support stations (MSSs) from time to time. The region covered by a MSS is called a cell; all communications among MHs are routed through MSSs. The idea is that MHs periodically send checkpoints to their current MSS. In addition, MSSs log all messages exchanged though them among MHs. Upon recovery, an MH can query its current MSS for its latest checkpoint; that query is then somehow satisfied by a set of MSSs (since the checkpoint may be distributed among them). Messages logged since the last checkpoint are replayed.

  • The Transaction Concept: Virtues and Limitations [gray81:transaction]
    Jim Gray (Tandem Computers, Inc., CA, USA), Proceedings of the International Conference on Very Large Data Bases, September 1981

    First definition of the ACID (or rather, ACD) properties for transactions.

  • Porting NSA Security Enhanced Linux to Hand-held devices [coker03:selinux]
    Russell Coker, Linux Symposium 2003 Proceedings, 2003
  • Reliable Computing over Mobile Networks [rodrigues95:reliable]
    L. Rodrigues and H. Fonseca and P. Ver, Proceedings of the 5th Workshop on Future Trends of Distributed Computing Systems, 1995

    This paper discusses the possibility of extending large-scale network fault-tolerant mechanisms to mobile networks. In particular, it evaluates the feasibility of such extensions for the implementation of a total order protocol (needed to implement a replicated state machine) and a remove invocation protocol called GRIP. For message ordering, the paper advocates a hybrid approach between token-based protocols (where one or more nodes are in charge of message ordering) and symmetric protocols (which are fully decentralized -- FIXME: how does it work?): active processes issue order numbers for their messages and those of other, passive, processes. Regarding remote invocation, GRIP allows to "plug-in" an implementation of causal messages, including transparent causal messages with relaxed semantics (i.e. such messages cannot cause other messages to be delayed).

  • Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems [prakash96:ckpt]
    R. Prakash, M. Singhal, appeared in IEEE Transactions on Parallel and Distributed Systems, October 1996
  • Support for Recovery in Mobile Systems [pedregal-martin02:recovery]
    Cris Pedregal-Martin, Krithi Ramamritham, appeared in IEEE Transactions on Computers, October 2002

    A checkpointing/recovery scheme for cellular networks, similar to park01:recovery.

  • Building Reliable Mobile-Aware Applications using the Rover Toolkit [joseph96:building]
    Anthony D. Joseph, Joshua A. Tauber, M. Frans Kaashoek (MIT Laboratory for Computer Science, Cambridge, MA, USA), Second ACM International Conference on Mobile Computing and Networking (MobiCom'96), 1996

    The Rover toolkit aims to provide a framework for recovery of transient failures of mobile nodes such as communication link failures and (limited) client software/hardware failures -- nothing really unprecedented. It does not address recovery from hard faults (e.g. power losses, theft, etc.), non-transient faults and server failures. To this aim, Rover provides mainly two services: relocatable dynamic objects (objects that can be transferred from a server to a client so as to reduce client-server communication requirements -- I am doubtful about the usefulness of such a thing, all the more that it is limited to interpreted languages I guess) and queued RPCs. QRPCs are split-phase, non-blocking RPCs. An Access Manager is responsible for logging them in order to be able to resend, upon client recovery, QRPCs that have not been sent; additionally, the AM automatically retries to send queued RPCs when network becomes reachable. Extensions for mobile applications include server logging of incoming and outgoing QRPCs as well as checkpointing of intermediary server state. State capture is provided by the extension of their target language, Tcl, providing support for stable (i.e. persistent) variables. Rover stores any changes to stable variables' value; however, restoring the state upon recovery is left to the programmer. That's it.

  • A Survey of Rollback-Recovery Protocols in Message-Passing Systems [elnozahy02:survey]
    E. N. Elnozahy (IBM Research, Austin, TX, USA), appeared in ACM Computing Surveys, September 2002
  • A Simple and Efficient Implementation for Small Databases [birrell87:small-db]
    Andrew Birrell, Michael B. Jones, Ted Wobber (DEC Systems Research Center, Palo Alto, CA, USA), Proceedings of the 11th ACM Symposium on Operating System Principles, November 1987

    Describes a nice simple approach to the implementation of small databases. The basic idea is to provide the programmer's high-level programming languages with primitives for the persistence of the strongly typed data structures, along with a transactional model allowing programmers to specify when changes to those data structures should be committed. In other words, this is similar to what is currently often referred to as the object prevalence paradigm. Interestingly, the authors mention a simple implementation of transactional storage on top of a regular Unix-like file system which is similar to that used by GNU Arch (i.e., using rename, which is atomic, as a building block).

(made with skribilo)