What & Why?
Television is fundamentally a way of transmitting a moving picture and
some sound from one place to another. In analog television, a light-sensitive
target is scanned, releasing an electrical current proportional to the brightness
of the scene at a given point. These electrical currents are transmitted
in sequence, and “fired” at an electrically-sensitive target in the TV receiver.
Each point on this target – the front of the picture tube – emits an amount
of light proportional to the intensity of the electrical current that struck
it. The TV picture is “painted” onto the screen one spot at a time, starting
in the upper left and moving right; once it hits the right edge of the screen,
it moves down and “paints” the next line.
Digital TV is very similar. The difference: in analog, a signal directly representing an amount of electrical current is broadcast and received. In digital, a number representing an amount of electrical current is broadcast and received. Why?
In analog, any transmitted signal (between reasonable limits) is permissible. 1 volt, 0.25 volt, 0.75 volt, 0.98 volt, 0.3997238 volt, all are possible, along with any other number you can concoct between 0 and 1. You transmit 0.4 volt; interference causes it to be received as 0.85 volt; the receiver has no way of knowing anything went wrong. It will display the incorrect value.
In digital, only certain values are permissible. 0, 0.125, 0.250, 0.375, 0.500, 0.625, 0.750, 0.875 may be the only signals allowed. If the receiver sees 0.785 volt, it knows something's wrong. It can be programmed to wait for the transmitter to send the signal again; to ignore it; or to make an educated guess of what it should have been.
It goes well past that. By applying various formulas to the numbers transmitted, it is possible to “skip” some of the signals. In analog TV, if you transmit a picture of the U.S. flag, you transmit “blue, blue, blue, blue, blue..{blue 244 more times}..red, red, red..{red 446 more times}..red{new row}blue, blue, blue...” That's a lot of redundant information.
In digital TV, I can transmit “blue 250 times, then red 450 times, then a new row just like the first one...” And if the flag is blowing in the breeze, you just tell the receiver “this bunch of spots moved this far in this direction” - you don't transmit the entire flag again. Now, most things you'd transmit are a lot more complicated than the American flag; you won't save nearly as much information in transmission. You will save a lot though.
How? (video)
As I mentioned before, digital television starts the same way as analog. You scan a light-sensitive target, accumulating a series of electrical signals proportional to the amount of light striking each spot on the target. In practice, to get color TV you use three targets. A special mirror and a set of color filters are used to ensure one target receives the red light; one the green light; and one the blue light. The electrical signals from all three targets are added to create a black-and-white signal, called luminance. The red and blue signals are also processed.5
The resulting signals are “sampled” - measured and converted to numbers at regular intervals. You already know that the picture is scanned – broken up into – horizontal lines. In analog TV there are 525 lines, but roughly 45 of them are used for synchronizing and blanking signals not necessary with digital. Standard-definition digital TV breaks the picture up into 480 lines.
In analog TV, the horizontal lines are not broken up into dots.6 A single line is sent as a continuous, varying current. In digital, however, the line is sampled as a series of dots, or picture elements. (pixels) On a standard TV set whose picture tube is 33% wider than it is high7 there are (4/3)*480 or 640 of these “pixels”. The digital TV system generates a number representing the brightness of the black-and-white luminance signal at each of these 640 pixels; it also generates a number representing the brightness of the red and blue signals at every other pixel.8
These numbers (I'll just call them “pixels”) are arranged in blocks eight pixels wide by eight lines tall; each block contains only luminance pixels, or only red pixels, or only blue pixels. A group of four luminance blocks is then combined with one block of red blocks and one group of blue blocks to form a macroblock.
Macroblocks become handy for compression, for reducing the amount of data that must be transmitted to represent a picture. Macroblocks are predicted to move as a group. Chances are good, for example, that a block of pixels representing an actor's face will move horizontally across the screen; chances are quite a bit less that the actor's face will change shape. (of course, certain movies will shatter this assumption!)
Macroblocks within a given row are arranged in slices. The MPEG-2 standard suggests between four and eight slices per row. The start of each slice is marked with a special code, so the receiver can more easily find it.9 Using more slices makes it easier for the receiver to recover from noise and interference, but it also requires more transmission bandwidth.
Finally, slices are arranged in frames. Each frame represents one still picture; as you probably know from analog TV, (or the movies) showing a sequence of stills in rapid succession presents the illusion of a moving picture. There are three types of frame used in MPEG-2:
I frames: (I = “Intraframe”)
Each I-frame is a complete picture. MPEG-2 requires an I-frame be sent at least every half-second for reasons that will probably be obvious when you understand what P and B frames are.
P frames: (P = “Predicted”)
The contents of a P-frame are predicted by the previous I or P frame. The P-frame contains only the difference between the picture it represents and that represented by the previous frame. If, for example, you were transmitting a picture of someone adding two stars to the American flag, the first I-frame might display the entire flag while succeeding P-frames might show only the stars.
B frames: (B = “Bidirectionally predicted”)
The contents of a B-frame are predicted by the previous and next frame. (obviously a B-frame cannot be displayed until the subsequent I or P frame is received!)
All of these frames are assembled in order into a stream of data. The stream is broken up into 188-byte10 “packetized elementary streams”, or PES. It is then multiplexed with audio and other data, and sent to the 8VSB modulator for transmission. More on all of that later...
How? (audio)
In analog television, a pressure wave striking a diaphragm in a microphone generates an electrical current. This current is amplified and sent to the receiver, where it's used to move the paper cone of a loudspeaker. This speaker converts the current back into changes in air pressure, which the viewer hears as sound.
In digital TV sound, as with the picture, the electrical current is measured at specific times, converted to a number, and that number is sent to the receiver. The receiver uses the number to regenerate the original electrical current and feed it to the speaker.
An ATSC DTV stream may11 contain one or more audio programs. Audio is coded with the Dolby AC-3 standard. Each audio program is generally either “5.1” or stereo. A 5.1 program has six audio “channels”:
Left
Right
Center
Left Surround
Right Surround
Low Frequency Enhancement (“subwoofer”)
Each channel has a frequency response of 3 – 20,000Hz. (the Low Frequency Enhancement channel's response is 3-120Hz. In theory all six channels have a low end going down to DC, but passing frequencies below 3Hz can screw up the data compression.) Each channel is sampled (measured, and converted to a number) at a rate of 48,000Hz. The sample resolution - how small of a difference in signal can be transmitted – can be as many as 24 bits, allowing 16,777,216 different signal levels. The AC-3 standard recommends use of at least 16 bits. (65,536 different levels)
If a stereo program (with no surround-sound information) is provided, it is “rematrixed” for compatibility with surround-sound receivers. Channels may also be “blended” (mixed, trending towards monophonic sound) if there's not enough bandwidth to transmit the data from all channels.
The numbers representing the sound signals are encoded into “sync frames”. Each such frame includes a “sync word” (which tells the receiver where the beginning of the frame is); an indication of the sample rate (48,000Hz in this case); information about the audio data stream; 0.032 second of encoded audio; and a CRC error check code. The latter code allows the receiver to determine whether a frame was received properly, and to discard damaged frames.12
AC-3 audio offers eight different audio services. In practice, I've never encountered anything besides “CM”. To be complete, the possible services are:
CM – Complete Main – a complete surround-sound service.
ME – Main music & effects – CM but without the dialog.
VI – Description of action for the visually-impaired. May or may not also include the music and effects elements.
HI – Specially-produced dialog service for the hearing-impaired.
D – Dialog only – intended to be mixed with ME. A single program may have several D services in different languages.
C – Commentary
E – Emergency messages – these will interrupt all other services.
VO – Voice Over – for example, to read promotions for upcoming shows during the credits of a previous show.
There are two other interesting features offered in the AC-3 standard.
Streams are encoded with a subjective “volume of dialog”. Receivers can use this information to automatically set the desired volume level, independent of changes in encoding between stations. As the standard notes, it's not impossible to cheat! - an advertiser could tell your receiver his audio was 10dB lower than program material, while it's actually 10dB higher, causing your receiver to play his ad 20dB louder than anything else.
Also encoded are audio compression settings. AC-3 anticipates that mild level compression will be used – that loud passages will be turned down somewhat, and quiet ones boosted somewhat. This is generally necessary to ensure viewers in noisy environments (the typical family room!) can hear the program. Viewers with quiet “home theater” environments can use this information to “undo” the level compression and restore the original full dynamic range of the sound.
All of these frames are assembled in order into a stream of data. The stream is broken up into 188-byte “packetized elementary streams”, or PES. It is then multiplexed with video and other data, and sent to the 8VSB modulator for transmission. More on all of that later... (I'll bet this paragraph was familiar...)