From Paper to the Internet:

Project Gutenberg

We live in a digital world overflowing with analog information. For every e-mail message a person receives, there's usually a corresponding voice mail item, a fax message, and probably a large heap of "snail mail" or a package delivered to you courtesy of your friendly postal employee.

Quite an industry has sprung up around the need to convert that analog information into digital. Optical Character Recognition (OCR) systems transform faxes and paper messages into ASCII text (or -- better yet -- styled text, complete with a font which closely matches the original). Voice recognition systems translate the human voice into a form more capable of being sent over a slow modem link or placed in a searchable database.

Though these tools can seem impressive, they're still not smart enough to take human beings out of the process. Even the best OCR packages still make enough mistakes to force someone to check over the entire result for errors. And when it comes to something as sensitive as converting works of literature to digital form, the time commitment required to make sure the work is rendered faithfully begins to soar.

Enter Michael Hart, who, in 1971, began a project to convert public domain texts -- ones whose copyrights had expired -- to digital form. While early attempts didn't bear very much fruit (only a few small texts were converted back then), in 1991 his project, named after the man who sparked the printing revolution, finally took root.

To date, Hart and his 500 Project Gutenberg volunteers have converted almost 250 texts, ranging from the U.S. Declaration of Independence to Frankenstein: to part of a turn-of-the-century version of the Encyclopaedia Brittanica, into plain ASCII text, readable by users of just about any computer on the planet.

Though at first converting a book from page to hard drive might seem a simple matter of running it through an OCR package (or typing it in by hand) and editing out typographical errors, Project Gutenberg insists on a rigorous production process. First, source material (chosen by the volunteers themselves; Hart says he has his own favorites, but "I don't want my biases, much as I may love them, to effect things too much.") must be old enough to be out of copyright -- Project Gutenberg runs a copyright check on a work before volunteers even begin work on creating an etext.

Second, Gutenberg volunteers try to make their plain ASCII texts as readable as possible. All Gutenberg texts are unformatted, with carriage returns at the end of every line. While plain text doesn't allow editors very many tricks -- no special characters, no altering the spaces between letters, words, and lines -- Gutenberg's guidelines do encourage editors to break their lines at the ends of complete thoughts or with punctuation marks. For example, take this passage from Frankenstein:

How slowly the time passes here, encompassed as I am by frost and
snow! Yet a second step is taken towards my enterprise. I have
hired a vessel and am occupied in collecting my sailors; those
whom I have already engaged appear to be men on whom I can depend
and are certainly possessed of dauntless courage.
which might read better (and more poetically) as:

How slowly the time passes here,
encompassed as I am by frost and snow!
Yet a second step is taken towards my enterprise.
I have hired a vessel and am occupied in collecting my sailors;
those whom I have already engaged appear to be men on whom
I can depend and are certainly possessed of dauntless courage.
After a while, it seems, one gets in the habit of thinking carefully about how to break ASCII text at the end of lines. Hart himself seems to take this habit to the extreme -- every line of text he writes (except those at the end of a paragraph) is exactly the same length. Some of us choose our words after carefully weighing their meaning; Hart seems to weigh their meaning and their length.

Finally, editions are reviewed by Hart himself, and then the "Gutenberg etext" is released to the world as version 1.0. As the work is disseminated and errors are discovered, volunteers will release new versions of the texts every so often.

While systems like the World Wide Web's HTML and Ian Feldman's Setext (used by InterText and TidBITS) allow creators of electronic texts to create texts without line breaks and add attributes like italics and bolding, Gutenberg relies on plain text. Hart's rationale is that while standards may come and go, ASCII is forever.

"Only two authors of hundreds I have spoken with actually say it may make a difference whether their works were emphasized in a particular way, so most of the time it wouldn't make any difference," he says. But Hart indicates that Gutenberg would be willing to post books in some mark-up format, as long as "Plain Vanilla ASCII" editions always remain available.

Of greater concern to Hart and Project Gutenberg are possible changes in copyright laws. Currently, a copyright expires after the creator of a work has been dead for 50 years. The more that length extends, Hart says, the less information will be available to "the Information Poor" -- people who don't have the ability to pay for searching through or reading copyrighted material. Right now any text created before 1920 is in the public domain, and new works will begin coming into the public domain this year. But the United States Congress is considering legislation that would extend the copyright moratorium so that post-1919 works wouldn't begin entering the public domain until 2015, and there's no guarantee that copyright protection will be extended even further before 2015 comes along -- long after the original creators of a work have profited off it, died, and left their estates to others who have also profited. "Adding another 20 years to the copyright incarceration of information won't help the Information Rich so much as it may move an Information Poor person over twice as far into the Dark Ages, by making them wait an additional 20 years for free access to information," Hart says.

The philosophy of making texts available to the information poor is what drives Hart and Project Gutenberg, and that's why the texts are available in ASCII. Essentially anyone with a computer -- even if the computer is of the 15-year-old, garage-sale variety -- can read Gutenberg etexts. If a computer has even the most rudimentary searching ability, it can be used to search Gutenberg etexts for relevant passages. In the end, an unlimited number of people will be able to choose from a large electronic library of texts while paying very little for the privilege. As CD-ROM technology expands and decreases in price, whole libraries of information will be available on just a few CD-ROMs at low cost.

It's a world-view that seems to be shared by Project Gutenberg's volunteers, who share Hart's enthusiasm for the project. "There's a wonderful feeling I get from seeing a book posted on the Internet and knowing I played a part in its existence," says Christy Phillips, a Gutenberg volunteer from Syracuse, New York. "Once in a while, I also will get e-mail from someone who read one of the books I helped create or edit, and that, too, makes me feel the work is all worthwhile."

For Hart, the birth of every new electronic text is cause for celebration. "I feel as if I have discovered Archimedes' Lever," he says, "and am jacking up a whole world just a little with each book."

-- Jason Snell


For More Information

The easiest ways to find out more about Project Gutenberg are to subscribe to the GUTNBERG mailing list by sending mail to listserv@uiucvmd.bitnet with the message "SUB GUTNBERG [YOUR NAME]" (no quotes) in the body, or by reading the mailing listŐs counterpart, the Usenet newsgroup bit.listserv.gutnberg. On the Internet, Project Gutenberg etexts can be found at mrcnext.cso.uiuc.edu, or on ftp.etext.org. Hypertext lists of Gutenberg etexts are available in official and unofficial versions.