It's not practical to think up a new algorithm every time you want to encrypt something. Useful encryption algorithms therefore use an extra piece of data called a key. In order to decrypt data, you need not only the decryption algorithm, but also the key; this means that you can use the same algorithm to encrypt different things and share them with different people. If an encryption algorithm doesn't use a key, then as soon as you know the decryption algorithm, you can decrypt all the things encrypted with the algorithm.
Cryptographic wisdom considers an encryption process secure when the only way to recover a plaintext message from ciphertext is to know or discover the key. Furthermore, this rule should still be true even if every other detail about the encryption process is known. In fact, cryptographers become very concerned about the security of an algorithm when more than a key must be kept secret (for instance, when the algorithm or an important part of it is not disclosed).
There are some things that encryption cannot do. When you encrypt something, you conceal its content, which doesn't necessarily conceal everything useful about it. Attackers may be able to gain important information by looking at the lengths of messages, or the times they're sent, or the senders and receivers; none of that will be changed by encryption.
Furthermore, encrypting something does not protect it from being changed. Somebody who has access to the ciphertext can change it. They won't be able to predict exactly what plaintext you'll get, but it will almost certainly be different. This sort of tampering is normally easy for a human to detect in text but not so if the plaintext was binary data. Since computers are unable to comprehend text, they cannot detect tampering by inspection as humans can. In order to protect a message from modification, and in a way that a computer can detect, you must use integrity protection, which we will discuss. Although some integrity protection systems use encryption, encryption by itself does not provide integrity protection.
The primary characteristic of an encryption algorithm is its strength, which is the amount of effort it takes for somebody to turn the ciphertext into plaintext without having the key. There's no absolute measure of the strength of an encryption algorithm. It's possible to say that some algorithms are very weak (for instance, replacing every letter with the letter three characters later in the alphabet, so that "b" becomes "e" and "c" becomes "f", is definitely a very weak algorithm), and it's possible to say that some algorithms have no known weaknesses. When we discuss particular algorithms, we will mention algorithms that are known to be attackable, or weak. A wide range of algorithms are available that don't have known weaknesses, and in most circumstances, you can and should choose one of them.
In the real world, strength is not the only important characteristic of an encryption algorithm. Some algorithms are much faster to execute than others, for instance, and the speed may depend on the kind of processor you have available. Some algorithms require more memory than others. Some algorithms have legal restrictions on their use, which are different in different countries. Some algorithms are patented and are expensive or difficult to license.
Public key and symmetric algorithms differ on all of these fronts (most notably, symmetric algorithms are much faster than public key algorithms), but the most important difference between them comes from the extra capabilities of public key algorithms. With a symmetric key algorithm, anybody who can encrypt a message can also decrypt the message, and it is therefore important to keep the one key secure so that it is known only by the parties who are trying to communicate. With a public key algorithm, you can make one key public while keeping the other one completely private, so that only you know it.
Having a public key and a private key opens up a number of new uses of cryptography. For instance, if you encrypt something with your private key, you don't keep the data secret (anybody who has the public key can read it), but you do prove that you encrypted it (because only you have the private key). This is called signing. Digital signatures are discussed in detail later in this appendix. Anybody who wants to send you something secret can encrypt it with your public key, and be sure that only you can read it. In fact, some ways of using public key technology don't actually allow you to conceal data; you can only use them for authentication; some public key algorithms are not encryption algorithms.
Just as there are differences between public key algorithms and symmetric algorithms, there are differences among symmetric algorithms. Some work on fixed-sized chunks of data and are called block ciphers. Others, called stream ciphers, work on an arbitrary sequence of bits or bytes. There are various ways, called modes, to extend a block cipher so that it can encrypt more than just a single block of data. Stream ciphers are naturally designed to handle an arbitrarily sized stream of data.
The encryption of variable amounts of data is usually called bulk encryption. In this case, anything bigger than about 64 bits is considered "bulk". Almost all bulk encryption is done with symmetric algorithms because of the speed difference between symmetric and public key encryption. It is frequently desirable to use bulk encryption in situations where the communicating parties don't already have a common symmetric encryption key. An extremely common way to solve this problem is to combine public key cryptography with symmetric key cryptography. For instance, the PGP package, commonly used for bulk encryption of electronic mail, uses a symmetric key algorithm to encrypt the body of a mail message, and then uses public key encryption to encrypt the symmetric key and sends the encrypted symmetric key with the message.
However, some ways of figuring out keys are based on how the algorithm works. With a symmetric key algorithm, the easiest way to figure out a key is to try all the possible keys until you find the correct one (knowing you have found the correct key is a separate, and tricky, problem). The more possible keys there are, the harder it is. With a public key algorithm, you can try to calculate a private key based on the public key using existing mathematical techniques if you know that you have a fast enough computer with enough memory, or you can invent a new mathematical theory for solving the equations used in the public key algorithm. Either of these methodologies will be easier than trying all possible private keys.
For any given algorithm, the longer the key is, the harder it is to find it out. On the other hand, you can't directly compare key lengths between different kinds of algorithms because the ways that they can be attacked are different. A 128-bit key is a pretty strong key for a symmetric algorithm but a pretty weak one for most public key algorithms. This is because 128 bits is a lot of keys to search but not large enough to prevent mathematical attacks on public key algorithms. Since different public key algorithms use different relationships between the private key and the public key, key lengths can't always be compared even between different public key algorithms.
If you know that trying all possible keys is the only way to find a key, you can be reasonably confident about your security; the speed of light imposes a theoretical limit on how fast computations can be performed, so if the key is big enough, nobody can be sure to find it within the estimated lifetime of the universe. (It's always possible to get lucky when you're trying keys, but you're much more likely to win a lottery.) The key length required to be big enough is surprisingly short; it's under 128 bits.
The situation with public keys is not as simple. Currently known techniques for finding private keys are difficult on today's computers, but it has yet to be proven that there are no faster techniques.
It's important to distinguish between key length and block length when discussing block ciphers. These are often but not always the same, and the block length is the length most often given. When somebody refers to a "64-bit block cipher", that is the length of the block, and not necessarily the length of the key. This can be horribly confusing, since for other ciphers the key length is the only thing normally specified in bits like that. Pay close attention to which length is being specified; a "64-bit public key algorithm" has a 64-bit key, while a "64-bit block cipher" might have any key length at all.
A checksum is usually just several bytes long and will take up much less space than the original data. While this makes them easier to store, it also means that there must be situations where a checksum will give the same answer for two different pieces of data. This is called a collision, and checksum algorithms are designed to make collisions unlikely to occur for the differences they are designed to detect. Checksum algorithms for communications are designed to detect random bursts of errors or chunks of missing data because they are the kinds of differences that you often get on a telephone line or a radio transmission (to humans listening, these errors sound like clicks or pops).
What if the error was not random and a deliberate change was intended? If so, is it possible to make a deliberate change and keep the checksum the same? For many checksums, this is certainly possible because the checksums are not designed to make this difficult. There are ways to design checksum algorithms so that it is impossibly difficult to make a deliberate change and still have the checksum match. Algorithms designed this way are called cryptographic hash functions, cryptographic checksums, or message digest functions.
Note that the terminology used for the different techniques and uses of cryptographic hashes is confusing and overlapping. As a result, terms are not used very consistently; about the best you can say is that any term involving integrity, digest, or the acronym MAC (Message Authentication Code) probably refers to some process that uses some kind of hash. If you care about the details, investigate them, rather than trusting that terms are used consistently from one document to another.
The term hash comes from another situation where it is useful to have a short fixed-length string that you can generate consistently from a larger string and that has large changes in the output as a result of small changes to the input. Hash algorithms and checksum algorithms are not always used for the same purposes, but if you extend either concept for cryptographic security, you reach the following set of conditions for a cryptographic hash:
Cryptographic hashes are also frequently used in authentication systems. In most cases, passwords are not encrypted, but hashed. If you use encryption on a password, it is possible to decrypt it and get back the password, which an attacker could then use somewhere else. This makes storing and sending encrypted passwords dangerous. Instead, secure systems store a hash of the password and, when a user wants to authenticate to the system, compute a hash of the password provided by the user. If the hashes match, the user must know the correct password.
This technique does not help if a user can directly provide the hash, instead of the password (for instance, if the user is logging in over a network and what the system actually receives is the hash). If the hash is sent around, it simply becomes another fixed and reusable password. (This is discussed further in Chapter 21, "Authentication and Auditing Services".) For this reason, good network authentication systems use a random piece of data that changes on every transaction, often called a nonce. Adding this changing value into the information that gets hashed and exchanged prevents eavesdroppers from reading and reusing the hashes.
It is possible to use a block cipher in a special way to calculate a cryptographic hash. When a cipher is used like this, the resulting checksum may be called a Message Authentication Code (MAC), a Message Integrity Code (MIC), or a Message Digest Check (MDC).
The use of block ciphers for calculating cryptographic hashes is often seen in older cryptographic protocols. In particular, they frequently use the Data Encryption Standard (DES) to produce 64-bit values. Most modern cryptographic protocols explicitly use cryptographic hash algorithms. One reason is that cryptographic hash algorithms tend to produce results that are 128 to 160 bits in size. This significantly reduces the chances of being able to come up with a different piece of data that produces the same hash value.
One way is to encrypt the cryptographic hash using a public key algorithm; this is essentially a digital signature, as we discuss later. Because of the slowness of public key encryption, it is not always a usable solution. Alternatives have been developed that are based on the calculation of cryptographic hashes, which include a secret key as part of the calculation. Without that key, someone wishing to change the data cannot recalculate a new and valid checksum. Many recent Internet protocols use a method called HMAC[188] for combining a key with a cryptographic hash. The HMAC technique can be used with any cryptographic hash function that produces at least 128 bits of output. HMAC is described in RFC 2104.
[188]HMAC, although it is capitalized like an acronym, does not have an official expansion in the standard. In the paper originally describing the algorithm, it stood for "Hash-based Message Authentication Code".
Computers are very bad at being random because they are designed to be able to repeatedly calculate the same answer if given the same data to work with. If you have an algorithm for giving you random numbers, you will get the same list of random numbers whenever you run it. This is why random number generators on computers are called pseudo-random. Knowing the algorithm and the starting information, and having a faster computer, would allow you to predict what the numbers were going to be. Truly random numbers cannot be predicted.
So where can a source of acceptable random numbers come from? It is easy if the computer has some hardware to generate random numbers; if not, then some random information can be obtained from peripheral devices attached to the computer. It has to be done very carefully because it is easy to overestimate how random some sources of information are. Sampling a fast running clock is not really a good way to get random numbers because typically only a small number of bits will change each time the clock is sampled.[189]
[189]RFC 1750 is an excellent source of information about random numbers for use in security applications.Several free Unix operating systems, including Linux, have special code as part of the low-level device drivers, which continuously collects random input events from the keyboard and other hardware devices and allows applications to obtain a source of random data.