Tuesday, May 14, 2013

Don't Suffer with Scrofulous Scanning

By:  Kimberly Hitchens is the founder and owner of Booknook.biz, an ebook production company that has produced books for over 1000 authors and imprints, with over 2,000 books up on Amazon.

To Scan or Not to Scan, that is the question...okay, it's not the question. You have a book in print, or you have an old "typed" or printed-out manuscript in a drawer, and the Kindle Gold Rush has motivated you to consider publishing your backlist, or pulling out that old manuscript and dusting it off.  But...then what do you do?

Why Does Quality Scanning Matter?

I’m often asked, in the course of our business, about scanning and OCR from print.  Basically, if you’re an author with a previously-published book (or old manuscripts for which you no longer have disks), and no matching digital file, in order to take the first step on the road to putting your book into eBook form you must first put it into digital format.  This first step is called “Scanning and OCR.”

What the Hell Does "Scanning and OCR" Mean?

Scanning means essentially what it sounds like; someone puts the pages of your book on a scanner, scans the page, creating an image of the page, and does this some 300 or so times, to capture all the pages of your book.  “OCR” means “Optical Character Recognition,” and this is where talent and training start to kick in; this is when the computer “reads” those images of text that are on a page, created from the scan, and “translates” them into actual characters, letters and numbers, that can be used by a computer as text, rather than images of text.  Like the difference between “old style” PDF’s—just images of text—and newer, searchable PDF’s, which are actually the text in a non-reflowing layout. 

Why Not Just Use the Cheap Guys Down the Street?

Now, I know better than most that publishing is a business.  Someone with a backlist title wants, of course, to get the most “bang for their buck,” and manage not to go too deep in the hole before they get their book on Amazon, available for sale.  But while it’s tempting to scrimp on scanning—especially with so many cheap scanning companies out there—it’s almost never worth the savings.  If you’re one of those folks with  pretty good expertise in the workings of Word, and more time than money, then using a cheap scanner could be the path for you.  If, however, you have a limited amount of your own time, or aren't that comfortable figuring out things like how to delete section breaks in Word, or to create new paragraph styles (or clean up old ones), then using a high-quality scanner can save you a lot of time and aggravation.

Most people are ill-prepared for the reality of “raw” scan output, before the file’s been proofed.  While many scanning firms, particularly the low-cost ones, will do a book for under $20 or $30 (and some even advertise $1/book scanning), what this means is that most will simply make an imaged PDF of the book—not a file you can edit, or make a Word file from to editing.  The second step—the OCR— doesn't get done with these “super-cheapo” firms.  Even those that do make the OCR’d file tend not to do a very great job of cleaning up the resulting scan.  Scans with OCR, by definition, will be riddled with errors.  Some words, for example, like “fiat,” will almost invariably come out as “hat,” not “fiat.”  The program won’t recognize this as a scanning error, so every scan has to be proofread, letter by letter and word by word, for scanning errors, no matter who does the work.  The words that the program does recognize as erroneous, or likely erroneous, will be marked in RED, to help you find them.  I've seen scanned pages that look like the battle of Gettysburg, when it comes to the amount of red on the page.  The better the scanning company, the fewer “bloody” marks you’ll find on the Word output page.  You can download and review a sample of some better-quality “raw” scan output by clicking here (will display in Word).  

Won't It Be The Same Wherever I Go for Scanning?

A good firm will have “trained” its software to the highest degree, and this will result in far fewer errors in the scanning and OCR.  (Yes, you can actually “train” OCR software to make fewer recognition errors.  It takes a lot of time, and effort, but the better firms do it.)  This means less work for you and a decreased likelihood that you’ll receive one of those “Kindle Quality Notices” pointing out typos that have to be fixed, if your own final proofing isn't as rigorous as it should be.

The next issue to consider is the experience of the scanner.  Some firms don’t really know how to use the OCR software, and commit beginner errors like outputting the OCR to Word al right, but the text on each page is formatted within a “frame.”  When this happens, there’s no quick way around it, no easy fix—all the text inside these frames (think “text boxes”) has to be copy-and-pasted by hand into a new Word document, in order to flow properly—and there is no reason for this to happen.  Only the most amateurish scanning firms make this mistake, so if you see it, you've been warned.  If you click the output text on a page, and a box appears “around” the text…that’s text in a frame, and you’ll have to manually cut-and-paste all the text, as described above, to “fix” it, before you can give it to a conversion house like ours.  (To see output text "inside" a frame, you can download a Word file scanned this way, HERE.  To see the "frames," you'll have to download the file and open it in Word on your own computer--note the "boxes" around the text, and just imagine copy-and-pasting 300 pages of this!)  

One of the other things that a good scanner will do, that a bad one won’t, is try to ensure that you’re offered “corrective editing,” (or something with a similar name), in which the scanning company will proof the output page, by eye, line by line, against the page provided to them for scanning.  This is usually somewhat costly—around $1/page—but many clients feel that the time-savings to themselves is worth it.  No scanning company proofreads the work, unless asked, and then, not without additional cost.  Don’t make the mistake of thinking that any scanning and OCR company is proofing the output—that’s not the way it works!

Uh-Oh, More Pilcrow Talk!

Broken paragraphs at page endings are a frequent problem with scanned pages.  When reading a printed page that breaks, in the middle of a line, at page’s end, we don’t give it another moment’s thought.  But when that page is output to Word, from a scan, the scan will almost always contain a “pilcrow” character (¶) at the end of that same line—which tells Word, and any other type of program that that location, where the line broke, is the end of a paragraph, and that the next words (which are at the top of the next printed page), are the beginning of a new paragraph.  If this isn't fixed before your output manuscript/book is sent to a conversion house, you’ll have two paragraphs where you should have had one, resulting in more editing effort by you, more work for the conversion house, and a higher end cost, as all conversion houses charge for edits you make to your content post-production.  So:  keep your eyes on those errant pilcrows!  (See my blog articles on pilcrows: http://crimefictioncollective.blogspot.com/2011/10/pilcrow-go-go.htmlhttp://crimefictioncollective.blogspot.com/2011/11/pilcrow-no-nos-part-ii.html and http://crimefictioncollective.blogspot.com/2012/05/dozen-dos-and-donts-on-prepping-your.html [#11] in that article).  Again—it’s not the job of the scanner to fix these page ending pilcrows—cleaning them up and prepping the manuscript for submission either to someplace like Createspace or to a firm like mine for making eBooks is your job as the publisher.  This is why using a better scanning firm is in your best interests; the fewer messy parts you have to fix, from the scanning job, the easier your responsibilities will be to execute. 

So:  Who Should I Use, Then?

These are just a very few of the reasons that hiring a high-quality scanning firm is in your best interests, if remotely in the reach of your budget.  We at Booknook.biz, when asked, invariably recommend Golden Images Scanning, owned by Stan Drew, http://www.pdfdocument.com, 636-379-9999, for the best quality work.  Stan’s used by the top author clients that we have, and for good reason.  He does give multiple discounts for repeat clients, multiple books, and the like, so you should not hesitate to pick up the phone and give him a call if you have a backlist book that you need to get into digital form, either for re-issuance as a print book, or for making eBooks for sale.  We don’t recommend any of the very inexpensive scanners, because the results we've seen from these places have been so wildly inconsistent that we can’t in good conscience say, “well, this one works great or pretty good, but that one doesn't ” because they all seem to have their bad and very bad days.  

Worse, many don’t  have phone numbers where you can pick up the phone when you have a problem and get an answer, or even someone to talk to.  This isn't to say that the least-expensive firms can’t on occasion do a decent job…but for 100% reliability, time and again, we like to recommend companies that we trust; in this case, Stan’s company.  In the last 4, almost 5 years, we haven’t seen scanning work from anyone else that’s even come close to the quality of Stan’s.  Yes, using firms that are actually using offshore labor, and thus cheap, is very seductive, for obvious reasons. Who doesn't want to save money?  But before you commit your valuable manuscript to a lower-quality firm, do give the points discussed in this blog piece some consideration.  Remember how much blood, sweat and tears you put into that book in the first place, and how important it is to you that it stay as you wrote it, not full of introduced typos and errata from a bad scan job.  

And that's this week's tip on Bookmaking from Booknook.Biz.  There's no crime in bargain-hunting; we encourage our clients to go out and seek competitive bids, always.  But when you competitively price scanning, make sure you are pricing apples to apples--and not apples to lemons.  

Thanks!  Until next time...



  1. Good advice, Hitch. I'll be sharing this one.

  2. I've haven't compared prices recently, but hiring a fast/accurate typist to recreate a book/manuscript into a Word doc could be cheaper. Twelve years ago, when OCR was still primitive and very expensive, that's the choice I made when I discovered that my files for The Baby Thief has disappeared. (A computer fire, followed by a young son grabbing one of my floppy disks...)

    In addition to getting a clean version of the manuscript, the typist also told me that she loved the story so much she'd started reading for pleasure for the first time in her life. She hadn't read a book since high school.

    Anyway, thanks for an informative post.

  3. Ah, Hitch, I love it when you talk technical...

  4. Hey, gang:

    You betcha, Gayle, I'm all about the tecchie stuff. I know that it's tedious going, but...we get a lot, a LOT, of scanned books here at Booknook.biz, and it makes me cringe to see what people have paid for. Yes, paying for cheap scanning, and then spending 30-40 hours of corrective editing works; but it can sure take a toll, especially if you're planning on proofing it again pre-publication, after it's laid out for print or digital book.

    And, LJ: yes, it's possible. I don't know what transcription is running these days, but the last client I had who had a book transcribed paid out about $1100. OUCH! ;-)

    Until next time, when I come to bore you...


