5 min read

Using Google Vision API for Detecting Text in an Image

Getting started with Google Vision API to detect text in a dense document. The post comes with code examples to get you started quickly with the Google Vision API.
Using Google Vision API for Detecting Text in an Image

Google Vision API offers a relatively inexpensive way to extract text from a variety of documents including images. This can be a great way to augment an existing application that rely on such functionality without building an in-house text detector.

The API is straightforward to use once you have setup a Google cloud account and created a service account. This is a brief tutorial on how to do exactly that. Hopefully, this short tutorial can get you started on using Google's vision API for your own use.

But, first let's look at how much it will cost us. Google offer introductory $300 credit for new users. You might be fully covered by that. Otherwise, the first 1000 request are free, then it is $1.50 per 1000 requests. You can also check full pricing break down here or check out the pricing chart below.


Google Vision API pricing

Creating Google Service Account

I am assuming, you have already have a Google Cloud account and logged-in through the console.cloud.google.com. We need to create a service account that can call the Vision API.

A Google Service Account is a special type of account that allow applications to talk to Google's services.  Behind the scene it uses public/private key-pair for authentication. Before using the code in this tutorial, you need to download the credentials for a service account in JSON format. Follow the steps in the video below to create a service account and download the keys in JSON format.  

Setting up

Once you have downloaded the keys, setup the path to the downloaded JSON file as an environment variable.

export GOOGLE_APPLICATION_CREDENTIALS=downloaded_file.json

Example Images

Google OCR can work with variety of documents and images. Here, I am going to use a screenshot of a  product listing from Amazon.com. These listing make an interesting use case for OCR as they are often text rich and have a decent structure to demonstrate the capabilities of the API. In some future post, I might refer to this post on using OCR to automatically scrape a product listing using some crawling and bounding box detection. Fee free to subscribe to get notified when that post is available.

Text Detection

If you are just interested in extracting text and its position then you can use the text_detection function of the vision API. Let's start by first loading the desired image. We will use OpenCV to load the image. However you can also use the pillow library as well.

import cv2
from google.cloud import vision
import numpy as np

path = "path/to/file.jpg"

image = cv2.imread(path)

OpenCV reads the image as numpy array that needs to be encoded in an image format.

_, encoded_image = cv2.imencode('.png', image)

Next, we will define the call to vision API and make a call to text_detection function for getting detected text.

api_image = vision.Image(content=encoded_image.tobytes())
response = client.text_detection(image=api_image)
texts = response.text_annotations

The response returned from text_annotations is documented here. We can use the bounding_poly property of the returned object to draw bounding boxes around the detected text.

for text in texts:
    # print (text.description)
    vertices = np.array(
        [(vertex.x, vertex.y) 
             for vertex in text.bounding_poly.vertices
    # We are using cv2 rectnagle method to draw bounding boxes
    # that requires 2 points on the image to draw the box
    # The top,left co-ordinates and bottom,right co-ordindates of the box
    # We can get those using the code below. 
    xmin, xmax = min(vertices[:, 0]), max(vertices[:, 0])
    ymin, ymax = min(vertices[:, 1]), max(vertices[:, 1])
    cv2.rectangle(content, (xmin, ymin), (xmax, ymax), (0, 255, 0), 1)

if response.error.message:

plt.figure(figsize=(20, 40))

... and here is the output from running the above full code.

Structured Output from the OCR API

if you want a more structured output or you have a multi-page dense document then you can make a call to full_text_annotation method. The response is a nested object that breaks down the document into pages -> blocks -> paragraphs -> words -> symbols. The hierarchical response have the following structure.

Under the word each character is also detected as symbol (not shown here). The full documentation of this nested object can be found here.

The example below shows how to extract these objects from the response and draw bounding boxes on detected text using their position in the hierarchy.

from google.cloud import vision
import cv2
import numpy as np
import matplotlib.pyplot as plt

def get_bounds(document, feature):
    bounds = {}
    bounds['blocks'] = []
    bounds['paragraphs'] = []
    bounds['words'] = []
    for page in document.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                for word in paragraph.words:

    return bounds

def draw_bounds(image, bounds, color, thickness):
    for bound in bounds:
        pts = np.array([(vertex.x, vertex.y) for vertex in bound.vertices])
        xmin, xmax = min(pts[:, 0]), max(pts[:, 0])
        ymin, ymax = min(pts[:, 1]), max(pts[:, 1])
        cv2.rectangle(image, (xmin, ymin), (xmax, ymax), color, thickness)

red =  (255, 0, 0)
green = (0, 255, 0)
blue =  (0, 0, 255)
image = cv2.imread(path)
_, encoded_image = cv2.imencode('.png', image)

api_image = vision.Image(content=encoded_image.tobytes())

response = client.document_text_detection(image=api_image)
document = response.full_text_annotation

bounds = get_bounds(document, FeatureType.PARA)
draw_bounds(image, bounds['blocks'], red, 2)
draw_bounds(image, bounds['paragraphs'], green, 1)
draw_bounds(image, bounds['words'], blue, 1)

plt.figure(figsize=(30, 50))

We are drawing bounding boxes around blocks, paragraphs and words. The red box is for the block, the green box represent a detected paragraph and finally the blue bounding boxes are for individual words.