Scene Text Detection

Chucai Yi 1,2, and Yingli Tian1,2

1 - Media Lab, Dept. of Electrical Engineering, City College of New York, New York, USA

2 - Dept. of Computer Science, The Graduate Center, City University of New York, New York, USA.


Text information in natural scene serves as an important indicator to acquire knowledge on ambient environment. With the development of smart mobile devices, text information retrieval will play a significant role in many image-based or video-based applications, such as assistant navigation, assistant reading, scene understanding, and geographical localization. However, scene text extraction is still a challenging problem because of cluttered background and multiple text patterns in natural scenes. Text extraction from scene image is divided into two tasks, text detection to find out the image regions containing text information, and text recognition to transform the image-based text characters into readable codes. This page will focus on scene text detection.



Cluttered background in natural scene.

Multiple text patterns with large intra-variations, such as sizes, colors, distortions, and string orientations.



Text detection is essentially a process of step-by-step background removal

Fig 1. A flowchart of scene text detection process, including layout analysis by color unformity and character alignment, and structural analysis by feature maps and learning model.



Layout Analysis and Text Structure Feature

Color uniformity: We observe that text information is generally attached to a plane carrier as attachment surface. The attachment surface consists of pixels with uniform color near the character boundaries but outside the character strokes. Thus color uniformity can be defined for both text and attachment surface as bigram color uniformity, and we use a color-pair composed of their colors to represent bigram color uniformity. For text and attachment surface, the color-pair reflects their respective color uniformity as well as color difference between them. In addition, text boundaries in different positions should be assigned into different boundary layers as possible, even though they have uniform color values and similar color differences. We add spatial position information into the boundary clustering process.

Fig 2. Examples of boundary layers from scene images; edge pixels with similar color-pairs and spatial positions are grouped into the same layer. Boundaries at different veridical positions are assigned into different boundary layers because of y-coordinate spatial information in clustering process.


Horizontal alignment: Based on our observations, text information in natural scene is usually in the form of text string, which is composed of three or more collinear character members. Thus we perform heuristic grouping and structural analysis on the character candidates to fit the aligned text lines, which can be extended into image regions of text strings.

Fig 3. (a) Sibling group of the connected component ‘r’ where ‘B’ comes from the left sibling set and ‘o’ comes from the right sibling set; (b) Merge the sibling groups into an adjacent character group corresponding to the text string “Brolly?”; (c) Two detected adjacent character groups marked in red and green respectively.


Stroke width and orinentation: Text characters consist of strokes in different orientations as the basic structure. Here, we propose a new type of features, stroke orientations, to describe the local structure of text characters. From the pixel-level analysis, stroke orientation is perpendicular to the gradient orientations at pixels of stroke boundaries. To model the text structure by stroke orientations, we propose a new operator to map gradient feature of strokes to each pixel. It extends local structure of stroke boundary into its neighborhood by gradient orientations. It provides a feature map to analyze global structures of text characters.

Fig 4. The left shows an example of stroke orientation label. The pixels denoted by blue points are assigned the gradient orientations (red arrows) at their nearest edge pixels, denoted by the red points. The right shows a 210 × 54 text patch and its 16- bin histogram of quantized stroke orientations.




Scene Text Classifier from Cascade Adaboost Learning

Cascade Adaboost classifiers proved to be an effective machine learning algorithm for handling imbalanced training data. The training process is divided into several stages. In each stage, based on all positive samples and the negative samples that are incorrectly classified in previous stages, Adaboost learning model performs an iterative selection of weak classifiers. The selected weak classifiers are integrated into a strong classifier by weighted combination.


Scene Text Character Recognition

Link to ..


1) A tiny software for scene text detection, which employs the algorithms of adjacent character grouping, and stroke-based text feature extraction. [ZIP]

2) Dataset of oriented scene text. [ZIP]


C. Yi and Y. Tian. Text Extraction from Scene Images by Character Apperance and Structure Modeling. In Computer Vision and Image Understanding, Vol. 117, No. 2, pp. 182-194, 2013. PDF

C. Yi and Y. Tian. Localizing Text in Scene Images by Boundary Clustering, Stroke Segmentation, and String Fragment Classification. In IEEE Transactions on Image Processing, Vol. 21, No. 9, 2012. PDF

C. Yi and Y. Tian. Assistive Text Reading from Complex Background for Blind Persons. In ICDAR Workshop on Camera-based Document Analysis and Recognition (CBDAR), Springer LNCS-7139, pp.15-28, 2011. PDF

C. Yi and Y. Tian. Text String Detection from Natural Scenes by Structure-based Partition and Grouping. In IEEE Transactions on Image Processing (TIP), Vol. 20, Issue 9, pp.2594-2605, 2011. PDF