Google's Giant Book Scanning Project

Discussion in 'General Distance Learning Discussions' started by BillDayson, Dec 21, 2004.

Loading...
  1. BillDayson

    BillDayson New Member

    The 12-20-04 'San Francisco Chronicle' has a story about Google's ambitious plans to create a searchable online library of truly massive proportions. Hopefully
    (this) link will work.

    The plan is to "digitize the collections of five top libraries: Stanford, Harvard, Oxford, the University of Michigan and the New York Public Library. The project eventually will allow any Internet user anywhere in the world to search inside millions of volumes, seeing the pages exactly as they appeared in the originals, complete with illustrations, charts and photos."

    More precisely, here's what it will consist of:

    "Stanford has 8 million volumes, Michigan 7 million. Oxford, Harvard and the New York Public Library initially will participate only with a limited part of their collections."

    Not too shabby. But don't get too excited yet. While this has the potential of putting the mother of all academic libraries a click away, no matter where we are located, in real life it might not turn out exactly like that.

    There's the massive scale of the project. The University of Michigan's librarian estimates that at the current pace, it will take Google 19 years to scan every page of his seven million books. And Stanford's scanning project won't even start till January. So the thing still has to scale up from a pilot project to industrial strength.

    And more ominously, there are serious copyright issues.

    "For books published since 1923, which are still under copyright, Google search results will be limited to excerpts, unless the publishers agree otherwise.

    However, Stanford's ownership of the physical volumes means it can offer full searches to its constituents, Herkovic said. "Our interest is in being able to dispense it ourselves to the Stanford community in full text without whatever limitations Google puts on its users," he said.

    For the past year, Google has been testing searches of excerpts within some copyrighted books under agreements with major publishers such as Houghton Mifflin, McGraw Hill and Harper Collins."


    There are probably gonna be some pretty serious limitations on how much text is downloadable, I guess. So it probably isn't going to be a real free full-text online library, at least for books post-1923.

    But maybe the complete Stanford-style access will be made available on a subscription basis, either to individuals or to university libraries. So DL students could get full-text access through their program, and the copyright holder would get a royalty for every download.

    We all know that the heart and soul of a real graduate program is located the library. And we also know that many DL programs fail badly in this area. So this project, or something like it, has the potential of providing DL programs with a wonderful resource to pass on to their students, giving geographically remote students access to a fully searchable, full-text virtual version of one of the world's most powerful research libraries.
     
  2. DesElms

    DesElms New Member

    I don't know why more people have not responded to your post, Bill. I, for one, am hugely excited about this. The Internet has long been seen as a medium certainly capable of this; and pre-web WAIS search capability and the meager repositories that at least some early pioneers out there tried to mount on the pre-web Internet gave us just a little bit of a taste. Ever since the advent of the "worldwide web" component of the Internet in 1994 (yes, the web's only not-quite-11 years old... hard to believe, isn't it?), I've hoped that it could become a nearly-complete library right on people's desks -- and for free, if possible. If Google prices the subscription part of it right so that we can all use it without breaking the bank, that would be a truly, truly amazing and useful thing. I already use the web to find tons and tons of things that I would have gone to the library to find not all that long ago. Even without this thing Google is planning, the web has helped me reduce my trips to the library significantly -- especially over the past... oh... I'd say four years or so.

    That having been said, I honestly hope that the Internet and its worldwide web never completely replaces libraries. While I generally despise trips to new and modern libraries, I do love finding the time to spend a few hours in a beautiful, high-ceilinged, ornate, old one with big leather chairs in little anti-rooms where one can just read for a while in a place of beauty.

    I want libraries -- especially old ones, like I just described -- to never go away for much the same sort of reasons that I hope books and newspapers that you can actually hold in your hand (as opposed to being only on futuristic video tablet devices) never go away. I never want society to lose the feeling of holding a newspaper, magazine or book in one's hands. Sadly, though probably not in my lifetime, I believe that that's where we're headed.

    Speaking of page-turning...

    Indeed, but I don't know if you also saw the companion news stories about this subject on the TV stations. At least one of them showed, in action, one of the new scanning machines that will be used -- the kind that uses robotics to hold the book, turn the pages, and do the scanning -- and though I've seen that sort of device in action before, I'd never seen one as fast as the one they showed in that TV news story. I'm sure they're outrageously expensive (although Google, of all entities can certainly afford a few almost no matter what they cost), but from what I saw of their speed, a mere half dozen or so of them running 24 hours a day could easily knock it out in a helluva lot shorter time than 19 years.

    Nineteen years is a long time -- especially in "technology years," if you know what I mean. Even if it's true that it would take 19 years, that's using today's technology (and/or that which those in the know can envision someone being able to develop during said 19 years). But if being witness to the computer age has taught those of us old enough to have done so nothing else, it is that better and faster technology than most people usually envision (or are even able to envision) tends to sneak up on us when we least expect it; and though 19 years may be the prediction now, sometime during the 19 years a new way of doing it that can shave-off 10 or 15 of those years could very well come along during the first five of them.

    Also, if the book publishers truly cooperate, the full text (as text, and not as graphic images which must be OCRed in order to become text) of most any book published in the last 10 to maybe even as long ago as 20 years will, no doubt, be in some publisher's coimputer (or backup tape) somewhere and can be brought into the database as indexable, searchable text... which, of course, could happen at literally hundreds or even thousands of times faster for a given book than the time it would take to scan said book and then OCR its graphical representations of characters into searchable, indexable text.

    So it may not be as daunting a task as some think. Or maybe it will be. Who knows. I just know that, from what I've observed, the nature of technology is that whatever we think we can't do today, tomorrow we tend to end-up shaking our heads in disbelief that we ever thought so.

    And let's not forget that Google typically spends a whopping 30% or so of its operating cash flow (some $300 million last year, alone) to research and development and R&D-related capitalization. With money like that being tossed at a project like this, technologies for getting these volumes into the database which we can't now even fathom will likely be developed at relatively breakneck speeds.

    Let's also not forget that the kids (and, really, they're mostly all mere children -- at least they are from my vantage point) at Google are truly best-of-breed. I kid you not. I've been in the computer business for nearly three decades, and I dare say I'm about as good at it as most any living person... I'm serious (self-aggrandizing as it most certainly sounds for me to say so); and, that notwithstanding, I was verily humiliated by one of their young managing engineers during a job interview at Google some time back during which I was made to feel inadequate because I could not perform in my head -- and in only seconds -- a complex hex calculation, or a subnet calculation that he had dreamed-up as an interview test. I've got tools right in the quicklinks area of my notebook's desktop that I can pop-up in quite literally an instant to do such calculations for me when I need them -- and damned quickly, too -- but that wasn't good enough for him. "My people," he chastised, "can do that sort of thing in their heads... most of them in their sleep, too." And I didn't get the job -- nor, after that, was I at all sure that I wanted it in any case.

    For those of you not in the Bay area, you may not realize that Google is the sort of place that tries to cherry-pick the very best minds from among the very best minds in truly creative ways; and summarily rejects people that most other companies would more or less kill to be able to hire. It's uncanny. From Google's annual programming contests culminating in a weekent wherein international finalists sit in a big room and are given problems and the first ones to write the very best code that best resolves them gets a job; to posting complex mathematical problems on outdoor billboards, the answer to which is a web URL where special Google jobs are listed which only those smart enough to solve the math problem can get to, Google really is where some of the best, brightest and, sometimes even scariest engineers routinely do their thing. In my opinion, if any company out there can figure out a way to do this monumental thing in a helluva lot less than 19 years, it's Google. Believe it!
     
    Last edited by a moderator: Dec 21, 2004
  3. Han

    Han New Member

    The beta version is also out on the Google Scholar and it is great........ only a matter of time.
     
  4. decimon

    decimon Well-Known Member

    I'd rather have all of this on DVD.

    The New York Public Library had (has?) a program of capturing its old books on 35mm film. A project of that size can last a career.
     

Share This Page