Search Challenge (9/2/15): How to search in a scanned document?

If your research is like mine…

… you fairly frequently find a document that’s from another era. It doesn’t even have to be that long ago before you find yourself dealing with infernally annoying crufty docs.

For instance, when I’m searching, I fairly often find a document that was scanned as an image. It’s great to have the document in the first place, but as a scan, it’s often less than completely useful.

Here’s an example. A document I found in one of my research studies was this excellent paper that’s available only in a scanned PDF format. (Here’s the LINK to the paper.) When you open it up, you’ll see sections that appear like this:

Of course, our usual Control-F / CMD-F tricks don’t work on this kind of scanned doc, and since this is a long paper, it makes it very much harder to read. In particular, what I WANT is this–something I CAN use Control-F on:

Our SearchResearch Challenge for this week is meant to give you an additional powerful tool for importing scanned documents and making them findable.

1. How can you transform this document (LINK) into something that you can search within?

2. Once you’ve done that, can you determine how many times the authors refer to “multiple documents” in that paper? (This was my original search task–finding interesting papers about how people read multiple documents at the same reading session. That’s how I found this paper.)

So this Challenge is really about “tool finding” — can you figure out how to convert from a scanned document into a readable / findable / searchable one?

(Big hint: It’s much easier than you think.)

Let us know how you found out how to do the magic process!

Search on!