I propose here a business plan for storing other people’s data. An enterprise called a data bank would accept, store and return data under a variety of contracts covering several contingencies.

This Slash Dot article and the ensuing comments illustrate the frustrations of the individual attempting to provide this service for himself. It is just not economically justified unless one is indeed savvy on current and past storage devices and formats. Even then the cost of attending to this task two or three times a decade makes it a precarious enterprise. Intergenerational projects may well be doomed.

The purpose of a data bank or archive is to store large bodies of information for long periods of time. I suggest here some protocols and contracts for a data bank and its customers. We then discuss risks, incentives and stratification of the data storage industry. Eric Hughes and jpp@markv.com have pointed out conceptual blunders and vulnerabilities in earlier versions. Perhaps they have been corrected here.

The data bank described here is oblivious of the nature of the data it keeps. A data collection is identified by its secure hash. The bank is motivated mainly by the penalties called for in several sorts of contracts it has signed and whose terms are described below. The bank need not even remember who owns the data. It should keep copies of contracts that it has signed; they are all short.

Here are several transactions that a data bank engages in.

Acquire data:
A client anonymously sends a collection of data along with funds sufficient to warrant the bank’s computing its secure hash and holding the data for a few days. The bank knows the data only by its secure hash.
Publish index:
The bank can publish its list of hashes. (This enables data hunters.)
Sell Data:
Any one can request a piece of data identified only by its secure hash. The bank is free to sell a copy of the data to anyone with the secure hash. The bank negotiates the price.
Selling a (Hat) Check:
The bank will sell a check to anyone who will pay a negotiated price. The check specifies the secure hash of the data, the cost of redeeming the data, and the penalty (liquidated damages) to be paid by the bank upon failure to produce the data. A client proposes the details of a check as follows: Send (SH(acquisition), redemption price, penalty, SH(Secret)) to the bank along with a proposed price. ‘Secret’ is a secret random number chosen by the client for this negotiation. If the bank agrees, it signs and trades the signed message for the proposed price, or it may propose another price. The signed message is the check and is a bearer instrument. The bank can sell multiple checks for the same data.
Cancel a check:
A holder of a check may sell it back to the bank at a negotiated price thus releasing the bank from the risk of paying a penalty in the future. The check is canceled by the mere fact that the bank learns and will remember the Secret that produced the SH(Secret) in the check. The bank need not otherwise acquire or maintain a signed check revocation. This also allows the bank to retrieve the physical storage where the data is stored if it is sure that it has not sold other checks for the data.
Access Data:
Any holder of a check can present the check, the redemption fee and demand the data. The data bank must then either:
Pay Penalty (liquidate damages):
The bank trades the amount of the penalty for the Secret of SH(Secret). A particular check is canceled whenever the bank pays the penalty like a spent Chaum DigiCash note. Alternatively the bank refuses, offering to provide the data instead. Yet alternatively the bank proves that the check is already canceled by stating its Secret. A zero knowledge proof might be used here to prove to the bank that the client in fact holds the Secret. Penalty negotiations can thus proceed without revealing the Secret to the bank. Since the penalty amount is probably the largest sum involved in these transactions, this transaction is most likely to require escrow service. It is also the most difficult transaction to carry out anonymously.
Bank proves it has certain data: (added April 2009)
For some negotiated fee the bank prepends a pattern provided by the client to the stored data and returns the new hash. For this to be useful the client will have had to have computed this hash while he still had access to the data. This keeps the bank from pretending it has the data in order to sell more hat checks while hoping the data will never be claimed. This transaction is in line with the dictum that when you contract for a future service, you need to ascertain that the supplier is competent to serve and has incentive and business plan to do so.
Checks may specify expiration dates, cancellation terms, etc. The bank is explicitly permitted to disseminate the data and may well do so to lay-off and reduce risks. In this sense a data bank is like an insurance company that spreads and shares risks. A check may be viewed as a life insurance policy for the data. Penalties for delayed data delivery might also be specified. This would make it easier for the bank to sub-contract data storage.

Risks

Trust may be divided by agreeing on an escrow agent. Upon redemption the bank examines the check to see if it has been canceled. If the bank knows the Secret which produced the SH(Secret) of the check the check is canceled. If the client is playing by the rules, the client and bank can proceed thru the redemption transaction without escrow and with only the redemption amount at risk of bank fraud. Alternatively a mutually trusted escrow agent takes the check, accepts the redemption payment specified therein from the client, passes over the data on its way from the bank to the client while computing the secure hash. If the secure hash matches that in the check the escrow agent delivers the payment to the bank. If the hash fails to match, the transaction is aborted and a penalty transaction begins. The bank delivers the penalty to the escrow agent and the client delivers the Secret to the escrow agent. If the hash of the Secret matches that in the check then the escrow agent delivers the Secret to the bank (canceling the check) and the penalty to the client. The escrow agent need not have long term financial stability as must the bank.

Inflation can damage incentives. Checks might be denominated in gold or currency baskets or whatever.

RSA modulus size is critical for long term contacts. 2K bits of modulus or more may be warranted.

Example

I can imagine the Getty Museum digitizing its Rembrandts and storing the results in a data bank. The data might be insured for $10,000,000. The bank would disseminate the data to increase security and lower its risk. The museum would probably encrypt the data and share the encryption key and hash à la Shamir for safe keeping. The museum would not share the Secret of the check because it wants to be the one paid upon default and, it wants to be sure that no one else cancels the check. It might disseminate the check but not the Secret of the check to others so that they are assured of getting the redemption price for accessing the data.

Incentives

A data bank, or any other player, may find it profitable to keep the data beyond the point of any uncanceled checks. It can make money by selling copies of the data. Data banks thus have an incentive to disseminate their list of holdings in the form of hashes, to support data hunters.

Eric Hughes notes an incentive for a bank to take the money and run at some point. If faced with a $10,000,000 penalty, a data banker may be unable or unwilling to pay. If escrow is used then the client who holds a check for the lost data can only damage the reputation of the bank. The reputation may be worth less than the $10,000,000. If the data bank is one and the same as some institution already required to have long term financial stability, this is correspondingly less of a problem. Sellers of life insurance policies and earthquake insurance are in this category. See “Deep pockets”, below, as a suggestion for a solution to this problem.

Design Considerations

It may seem strange that the data bank is willing to sell data to who ever will pay. I suggest this because it is so easy to encrypt the data and not have to trust the bank. You can distribute the key thru whatever channels you transmit the secure hash of the data.

Note that bank clients are always anonymous. Data is never held for some known person. Data may be held solely for speculation. The purpose of the penalty is to motivate the bank to keep data when there is no reason for the bank to forecast sales revenue. Unlike Chaum bank notes, the issuance of a hat check may be associated with the redemption. The depositing of data and hat check issuance, however, may be anonymous. Data redemption may be anonymous but collecting a substantial penalty may be difficult to arrange anonymously. Managing anonymous transactions is a difficult but orthogonal issue.

One way to manage anonymous data acquisition is to emulate TCP’s ability to assemble packets that are reordered, redundant and missing, into a complete whole. Forward error control (ECC & such) applied across packets can alleviate missing packets.

The Bank’s State

Logically the bank can perform all of these transactions by merely keeping the unordered set of acquisitions. It is practically necessary to index these by their secure hash but this can be rebuilt from the acquisitions themselves. When it loses data it must keep canceled checks to avoid extra penalties. The bank need not keep records of checks that it has sold unless it wants to know when it can delete acquisitions. The bank will keep a list of Secrets indexed by SH(Secret) in order to detect canceled checks. It may want to keep marketing information to know when acquisitions are worth keeping merely to sell copies. The bank will need to keep records of the checks that it issues for financial auditors (to satisfy owners of the bank.)

Bank Strategies

Banks might subcontract with other banks to: As in insurance, banks can reduce penalty risks by subcontracting with other banks. There is then a risk that in some cycle of banks, each depends on the next to have the data. A bank has an incentive to occasionally demand portions of the data that it has contracted for. Any such cycles are thus detected early. A bank should recompute the hash occasionally for any data for which it is liable for loss. If the data is duplicated then the bank can buy repairs from another bank.

Stratification

Perhaps the long term data storage industry can be divided into the following pieces: Data Hunters engage in knowing who has what data. Given a hash they can tell you what banks have the data. This might be the ultimate URL or URI server.

Ted Anderson has made some proposals along the same lines. It would be good to compare them in detail. There are references there to notes with ideas similar to these. I don’t claim priority here.

Tahoe is a promising technology upon which a data bank might well be based. The ideas presented on this page merely presume infrastructures such as Tahoe. This page merely says why a service such as the data bank is needed. I see no information on their site describing obligations they undertake for your money.


It just now occurred to me that the following wrinkle would make this system more robust for one who needs to store data. Contract with more than one institution. This guards against default of a data bank. The cost of insuring thru two companies should be less than the sum of independent insurance providing those companies know of the tandem contracts, and you can tell them! They can share the costs and take what ever measures they deem prudent. The professional data banks are in a better position to know the circumstances of other banks. I haven’t thought thru the sorts of interbank arrangements likely to occur in this case but I suspect they are complex. If one bank buys the other it would be best to sell one of the hat-checks back and buy another for the same data from another bank. This seems predicated on such a notion.
There is another malfeasance to be aware of. The data bank may become bankrupt and then claim that it cannot perform its obligations without payment in excess of that agreed to in the contracts. Worse it may feign insolvency to this end. Such an institution would thereby lose its credibility but the operator may conclude that holding current data hostage has greater long term payout. Just now I do not have a solution to this perverse incentive.