from ars technica:
Google went ahead and did it. Books no longer in copyright are now available for download from the Google Book Search site. If you’re looking for something tasty, might we recommend an early English translation of Montaigne’s provocative essay “On Some Verses of Virgil”? (Hint: the naughtiest bits are in the Latin epigrams, the worst of which aren’t even translated).
There’s plenty of precendent for this sort of thing. Project Gutenberg provides access to 19,000 classic books, but in a text-only format. The Christian Classics Ethereal Library offers both text and PDF versions of a massive collection of source material, but only one one particular topic. There’s also the Perseus Project, which offers ancient and Renaissance texts. Google could top all of these projects by providing fully-searchable versions of a much wider selection of books, many of which can also be downloaded as PDFs that are ready to print.
While this only applies to older books, it’s still a great way of democratizing access to the world’s knowledge (in English, at any rate), and it can’t raise any objections from publishers. Books which were before available only on the shelves of large academic libraries are now available to anyone with a Web connection and some curiosity. Scienta vincit omnia!
But not everyone is thrilled with the results so far. From Planet PDF:
There’s no doubt Google needs to be applauded for the idea, but the execution (i.e. the books they’ve produced) could definitely do with some work. The PDF books are difficult to download, large in size, of such low resolution they’re difficult to read, unsearchable, and do not allow the user to copy text from them. It’s left me wondering what Google expects people to do with the books.
And more critique here.
It’s strange, because they’re obviously searchable /by Google/, since the .pdf’s show highlighted items. So they are indeed OCR’d. I wonder what the blue lines through certain parts mean. I hope half the stuff doesn’t have to be rescanned; though, of course, that could be done.
It may not be everything, but it’s a pretty hefty wedge in the door, I think. And, frankly, given what I’ve found so far, who /needs/ anything written in the 20th century? : )
Planet PDF’s criticisms are poorly thought-out. For example, the statement that the books “are difficult to download” is a broad and generic statement; I certainly had no difficulty downloading Montaigne’s Essays by simply clicking the obvious “download” button. But perhaps Planet PDF says this in relation to their statement, “Clicking on a web link to a PDF file normally by default opens the document inside a Web browser.” This behavior exists in Internet Explorer on Windows with Adobe Acrobat installed, as it does in Safari on the Mac; Firefox, however, is polite enough to ask what a user wants to do with content that requires it to launch another program.
I do agree that PDF optimization is useful and should be standard practice — pdfopt has been widely and freely available for ages as part of Ghostscript. Optimization can, however, lead to larger file sizes, and Google may have been trying to avoid that.
Text OCR would be nice, but there are projects for that already. Google isn’t into duplicating effort, generally. Their product is highly legible and gives one a sense of the effort that once went into fine typesetting: compare the Montaigne download to PDFPlanet’s Fables of Aesop, generated with Microsoft word, and you’ll see the difference between typesetting and bulk text chundering. The former is an image of a work of skill; the latter is the McDonald’s version. If Google were trying to put Project Gutenberg out of business, perhaps PDFPlanet’s criticism would be relevant — but they fail to realize that each project has a separate goal.
Speaking of beauty in typesetting, and having scrubbed my eyes from looking at that Aesop translation generated by Word, the comment PDFPlanet makes about the resolution of Google’s texts being so low for easy legibility is pure bunk. Blowing up the text so that “hoc est” fills my screen reveals jagged edges, but at a reasonable zoom level, the text passes for perfect. Perhaps PDFPlanet forgot to turn on antialiasing in their PDF renderer? Perhaps they engage in hyperbole to attack a major project in order to gain attention for their parent company’s commercial ventures (NitroPDF)? It has succeeded, as I for one had never heard of their site or their product until this post on Stoa.
Yeh but for blind people, how can they get text from the pdf? there’s no way to ocr it once you download it, it won’t let you.