CaptionBot AI vs. ReCaptcha2

Recently Skype has been pushing it’s bots, one of them “CaptionBot” is an AI that can describe what it sees in an image. For the past few months I have been working at a company that does web crawling, and recaptcha is never fun. Sure, there are some services that will tell you where to click (for a fee, of course), but they still leave something to be desired. I decided to have a look at CaptionBot and see what results I get.

First of all, we need a way to get the captions from captionbot. Just some web debugging will give us all the code we need for this to work.

Recreating the steps:

1
2
3
4
5
6
7
8
> curl 'https://www.captionbot.ai/api/init'
"BVPVg1MWjEM"
> curl 'https://www.captionbot.ai/api/upload' -H 'Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryNAK95DhgT8WjR8' --data-binary $'------WebKitFormBoundaryNAK95DhgT8WjR8\r\nContent-Disposition: form-data; name="file"; filename="Dell-inspiron-3542-ubuntu-os-1.jpg"\r\nContent-Type: image/jpeg\r\n\r\n\r\n------WebKitFormBoundaryNAK95DhgT8WjR8--\r\n'
"https://captionbot.blob.core.windows.net/images-container/trejmded.jpg"
And lastly:
> curl 'https://www.captionbot.ai/api/message' -H 'Content-Type: application/json; charset=UTF-8' --data-binary '{"conversationId":"BVPVg1MWjEM","waterMark":"","userMessage":"https://captionbot.blob.core.windows.net/images-container/trejmded.jpg"}'
"{\"ConversationId\":null,\"WaterMark\":\"131113912716453514\",\"UserMessage\":\"I think it's a flat screen tv. \",\"Status\":null}"

So I wrote a small python utility to help me with this: you can find it here. (Also available on PyPI)

Next, we need the images and challenges from recaptcha. Easy. All we need to do is to find a test page and open a browser to it, click the checkbox, and then save all the images that come up:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import os
from os.path import join as pathjoin
from io import BytesIO
from hashlib import md5
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from PIL import Image
here = os.path.dirname(os.path.abspath(__file__))
dataset_dir = pathjoin(here, 'dataset')
def crop(image, vertical, horizontal):
imgwidth, imgheight = image.size
step_vertical = imgheight/vertical
step_horizontal = imgwidth/horizontal
for j in range(vertical):
for i in range(horizontal):
box = (i*step_horizontal, j*step_vertical,
(i+1)*step_horizontal, (j+1)*step_vertical)
yield image.crop(box)
driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.get('https://www.google.com/recaptcha/api2/demo')
while True:
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it(
(By.XPATH, './/iframe[@title="recaptcha widget"]')
))
# Click the checkbox
driver.find_element_by_id("recaptcha-anchor").click()
driver.switch_to.parent_frame()
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it(
(By.XPATH, './/iframe[@title="recaptcha challenge"]')
))
challenge = driver.find_element_by_xpath(
'.//div[@class="rc-imageselect-desc-no-canonical"]/strong').text
image_grid = driver.find_elements_by_xpath(
'.//div[@class="rc-image-tile-target"]//img')
# Get the dimensions of the grid
grid_size = \
tuple(map(int, list(driver.find_element_by_xpath(
'.//table[starts-with(@class, "rc-imageselect-table-")]').
get_attribute('class').split('-')[-1])))
current_dir = pathjoin(dataset_dir, '%d-%s' % (len(image_grid), challenge))
os.makedirs(current_dir, exist_ok=True)
# Get the captcha image so we can crop it
main_image = Image.open(BytesIO(
requests.get(image_grid[0].get_attribute('src')).content))
for i, im in enumerate(crop(main_image, *grid_size)):
out_stream = BytesIO()
im.save(out_stream, format='PNG')
out_stream.seek(0)
hashval = md5()
hashval.update(out_stream.getbuffer())
im.save(pathjoin(current_dir, '%s.png' % str(hashval.hexdigest())))
driver.refresh()

I left this running for about a day, though it seems that to get a better dataset one should run this once every few days. I only got 4 different challenges - 4 4 street sign marking, 3 3 river, 2 4 store front, 3 3 street numbers. Yet now I also get captchas for trees and mountains. Since the street numbers and street signs didn’t have enough data I decided to test only store fronts and rivers.

Naturally, Microsoft’s AI describes the image and doesn’t tag it, so I had to analyze some of the data to find the relevant keywords for each category:

1
2
3
4
mapping = {
'rivers': ['river', 'water', 'lake', 'harbor', 'ocean', 'boat', 'moat', 'pond', ],
'store front': ['store', 'in front', 'restaurant', 'side of', 'sign', 'city', 'shop', 'door', 'screen', 'street', ],
}

The results I got for this are:

1
2
3
4
5
6
7
store front
Pd/Recall: 0.6564885496183206
Precision: 0.593103448275862
rivers
Pd/Recall: 0.9041095890410958
Precision: 0.9295774647887324

Amazing! (For rivers, this means that if there is a river image, there is a 90% chance you will detect it, and a 7.1% chance that you will mistakenly select an image which isn’t a river). I have also started testing this on the “trees” and “mountains”, hopefully the results will be similar to “rivers”.

AI is progressing fast, but it’s not there yet. It still has some trouble describing images where there is too much going on, but does very well with landscapes.

Warning: If you try this yourself be willing to accept that google will treat you as a bot for a while (requiring captchas for searches, etc.)

Enjoy!

Share0 Comments