2

I'm fascinated by the CAPTCHA system used on SO... I would like to know more about the "many factors" which make reCAPTCHA work. The developers, understandably given the potential for abuse, keep rather quiet about the exact inner workings of their system... But the behavior is well-documented, and so perhaps my curiosity can still be sated:

If I were to design a clone of reCAPTCHA, how might I go about it?


reCAPTCHA allows:

  1. a typing mistake
  2. at a place where people do them. This suggests me that you need to have historical data about errors, and then make an algorithm based on that.

The detection of typing mistakes requires extensive use of databases: one for words from books being digitized and the other for words which are known.

Technical known details

  1. two databases: one for known words and the other for unknown words
  2. subsequent database for combination of word

Unknown technical details

  1. How can the words be separated on fly such that you see a combination of words from different databases? This is about signal-processing.
  2. How can the data from two databases be given for user?
  3. Which is the initial form of data in two separate databases? PDF?
  4. Which is the subsequent form of data when data from two databases is combined? Pdf?
  5. How can the data be combined to from two pdf -files to one?
  6. How can you effectively rotate images?
  7. Which algorithms are used to separate the images from the book?

Related topics

  1. signal processing
  2. calculus: series such as Fourier and Laplace for algorithms in word detections.
  3. probability theory: to have a "computer-human" coefficient which is only passed if it is, for instance, with 95 confidence interval
  4. Perhaps number theory: we need to be effective in storing and comparing the data
Community
  • 1
  • 1
Léo Léopold Hertz 준영
  • 134,464
  • 179
  • 445
  • 697
  • 1
    see this question: http://stackoverflow.com/questions/8472/practical-non-image-based-captcha-approaches – z - Jun 02 '09 at 21:01
  • 1
    @yx: The post does not answer my question. I want to know how many typing mistakes the captcha allows, and how it know which is the correct letter and which is not. – Léo Léopold Hertz 준영 Jun 02 '09 at 21:08
  • 4
    Recaptcha works by pulling two word images from scanned books where the default ocr was unable to establish the exact text. One of the words shown is known to the system and the other is known only with a low degree of certainty (possibly even 0). You must enter the known word almost exactly and the lesser-known word within some computed distance of it's suspected value. Your input is then used to help establish the value of the unknown word, so that it can eventually move to the 'known' category. – Joel Coehoorn Jun 02 '09 at 21:19
  • So in addition to being the altruistic choice (helping digitize old books), recaptcha is also considered to be very secure, because anything it shows you has already passed a sophisticated and expensive ocr system. – Joel Coehoorn Jun 02 '09 at 21:25
  • The downside is that sometimes you can see some very odd captchas. For example, you might see half of a hyphenated word, a numeric value like a dollar amount or part of a numbered list, or even complete static. – Joel Coehoorn Jun 02 '09 at 21:30
  • @Please, reopen the question. -- I am interested in Math and data structures - not in general answers. – Léo Léopold Hertz 준영 Jun 02 '09 at 21:31
  • 1
    You won't get the math - the gritty details are of necessity not shared. However, I could tell you how I'd put something like that together, and it's much simpler than what you're proposing. – Joel Coehoorn Jun 02 '09 at 21:46
  • 1
    @Masi: I've edited this in hope that it could be turned into something answerable. I understand your curiosity, but asking for details of a specific system on a public site when the developers aren't even putting those details on their own site is setting yourself up for disappointment. – Shog9 Jun 04 '09 at 17:49
  • @Shog9: Yes, it is difficult to get good answers to challenging topics. However, the thread is a long term project which I aim to solve. I will give more exact details such as about algorithms asap I get them. – Léo Léopold Hertz 준영 Jun 06 '09 at 15:40

1 Answers1

3

reCaptcha

Ólafur Waage
  • 68,817
  • 22
  • 142
  • 198
  • I read the pages. However, it does not answer my question. It does not say how Captcha really works. How many typing mistakes does Catpcha allow? If Captcha is unsure about the correct word, how does it decide whether user's letter is correct or not. -- Your link mentions that the words are the ones computers cannot read. => IF computer cannot read the words, how do they know whether user gives a right answer? – Léo Léopold Hertz 준영 Jun 02 '09 at 21:11
  • It's on their wiki page under FAQ: http://wiki.recaptcha.net/index.php/FAQ#reCAPTCHA_is_accepting_incorrect_words – Ólafur Waage Jun 02 '09 at 21:15
  • @Waage: It seems that they keep the api hidden: "This is tuned dynamically based on many factors." – Léo Léopold Hertz 준영 Jun 02 '09 at 21:18