Mask textspotter

1 Introduction

In recent years, scene text detection and recognition have attracted growing re- search interests from the computer vision community, especially after the revival of neural networks and growth of image datasets. Scene text detection and recog- nition provide an automatic, rapid approach to access the textual information embodied in natural scenes, benefiting a variety of real-world applications, such as geo-location [58], instant translation, and assistance for the blind.

Scene text spotting, which aims at concurrently localizing and recognizing text from natural scenes, have been previously studied in numerous works [49, 21]. However, in most works, except [27] and [3], text detection and subsequent recognition are handled separately. Text regions are first hunted from the original image by a trained detector and then fed into a recognition module. This procedure seems simple and natural, but might lead to sub-optimal performances for both detection and recognition, since these two tasks are highly correlated and complementary. On one hand, the quality of detections larges determines the accuracy of recognition; on the other hand, the results of recognition can provide feedback to help reject false positives in the phase of detection.

Recently, two methods [27, 3] that devise end-to-end trainable frameworks for scene text spotting have been proposed. Benefiting from the complementarity between detection and recognition, these unified models significantly outperform previous competitors. However, there are two major drawbacks in [27] and [3]. First, both of them can not be completely trained in an end-to-end manner. [27] applied a curriculum learning paradigm [1] in the training period, where the sub-network for text recognition is locked at the early iterations and the training data for each period is carefully selected. Busta et al. [3] at first pre-train the networks for detection and recognition separately and then jointly train them until convergence. There are mainly two reasons that stop [27] and [3] from training the models in a smooth, end-to-end fashion. One is that the text recognition part requires accurate locations for training while the locations in the early iterations are usually inaccurate.The other is that the adopted LSTM [17] or CTC loss [11] are difficult to optimize than general CNNs. The second limitation of [27] and [3] lies in that these methods only focus on reading horizontal or oriented text. However, the shapes of text instances in real-world scenarios may vary significantly, from horizontal or oriented, to curved forms.

In this paper, we propose a text spotter named as Mask TextSpotter, which can detect and recognize text instances of arbitrary shapes. Here, arbitrary shapes mean various forms text instances in real world. Inspired by Mask R-CNN [13], which can generate shape masks of objects, we detect text by segment the instance text regions. Thus our detector is able to detect text of arbitrary shapes. Besides, different from the previous sequence-based recognition methods [45, 44, 26] which are designed for 1-D sequence, we recognize text via semantic segmentation in 2-D space, to solve the issues in reading irregular text instances. Another advantage is that it does not require accurate locations for recognition. Therefore, the detection task and recognition task can be completely trained end-to-end, and benefited from feature sharing and joint optimization.

We validate the effectiveness of our model on the datasets that include horizontal, oriented and curved text. The results demonstrate the advantages of the proposed algorithm in both text detection and end-to-end text recognition tasks. Specially, on ICDAR2015, evaluated at a single scale, our method achieves an F-Measure of 0.86 on the detection task and outperforms the previous top performers by 13.2% − 25.3% on the end-to-end recognition task.

The main contributions of this paper are four-fold. (1) We propose an end-to-end trainable model for text spotting, which enjoys a simple, smooth training scheme. (2) The proposed method can detect and recognize text of various shapes, including horizontal, oriented, and curved text. (3) In contrast to previous methods, precise text detection and recognition in our method are accomplished via semantic segmentation. (4) Our method achieves state-of-the-art performances in both text detection and text spotting on various benchmarks.

2 Related Work

3 Methodology

The proposed method is an end-to-end trainable text spotter, which can handle various shapes of text. It consists of an instance-segmentation based text detector and a character-segmentation based text recognizer.

3.1 Framework

The overall architecture of our proposed method is presented in Fig. 2. Functionally, the framework consists of four components: a feature pyramid network (FPN) [32] as backbone, a region proposal network (RPN) [40] for generating text proposals, a Fast R-CNN [40] for bounding boxes regression, a mask branch for text instance segmentation and character segmentation. In the training phase, a lot of text proposals are first generated by RPN, and then the RoI features of the proposals are fed into the Fast R-CNN branch and the mask branch to generate the accurate text candidate boxes, the text instance segmentation maps, and the character segmentation maps.

Backbone Text in nature images are various in sizes. In order to build high-level semantic feature maps at all scales, we apply a feature pyramid structure [32] backbone with ResNet [14] of depth 50. FPN uses a top-down architecture to fuse the feature of different resolutions from a single-scale input, which improves accuracy with marginal cost.

RPN RPN is used to generate text proposals for the subsequent Fast R-CNN and mask branch. Following [32], we assign anchors on different stages de- pending on the anchor size. Specifically, the area of the anchors are set to {$32^2, 64^2, 128^2, 256^2, 512^2$} pixels on five stages {$P_2, P_3, P_4, P_5, P_6$} respectively. Different aspect ratios {0.5, 1, 2} are also adopted in each stages as in [40]. In this way, the RPN can handle text of various sizes and aspect ratios. RoI Align [13] is adapted to extract the region features of the proposals. Compared to RoI Pooling [8], RoI Align preserves more accurate location information, which is quite beneficial to the segmentation task in the mask branch. Note that no special design for text is adopted, such as the special aspect ratios or orientations of anchors for text, as in previous works [30, 15, 34].

Fast R-CNN The Fast R-CNN branch includes a classification task and a regression task. The main function of this branch is to provide more accurate bounding boxes for detection. The inputs of Fast R-CNN are in 7 × 7 resolution, which are generated by RoI Align from the proposals produced by RPN.

Mask Branch There are two tasks in the mask branch, including a global text instance segmentation task and a character segmentation task. As shown in Fig. 3, giving an input RoI, whose size is fixed to 16 ∗ 64, through four convolutional layers and a de-convolutional layer, the mask branch predicts 38 maps (with 32 ∗ 128 size), including a global text instance map, 36 character maps, and a background map of characters. The global text instance map can give accurate localization of a text region, regardless of the shape of the text instance. The character maps are maps of 36 characters, including 26 letters and 10 Arabic numerals. The background map of characters, which excludes the character regions, is also needed for post-processing.

3.2 Label Generation

For a training sample with the input image I and the corresponding ground truth, we generate targets for RPN, Fast R-CNN and mask branch. Generally, the ground truth contains P = {$p_1, p_2...p_m$} and C = {$c_1 = (cc_1, cl_1), c_2 = (cc_2, cl_2), ..., c_n = (cc_n, cl_n)$}, where $p_i$ is a polygon which represents the localization of a text region, $cc_j$ and $cl_j$ are the category and location of a character respectively. Note that, in our method $C$ is not necessary for all training samples. We first transform the polygons into horizontal rectangles which cover the polygons with minimal areas. And then we generate targets for RPN and Fast R-CNN following [8, 40, 32]. There are two types of target maps to be generated for the mask branch with the ground truth P, C (may not exist) as well as the proposals yielded by RPN: a global map for text instance segmentation and a character map for character semantic segmentation. Given a positive proposal r, we first use the matching mechanism of [8, 40, 32] to obtain the best matched horizontal rectangle. The corresponding polygon as well as characters (if any) can be obtained further. Next, the matched polygon and character boxes are shifted and resized to align the proposal and the target map of H × W as the following formulas: $$B_x = (B_{x_0} − \min(r_x)) × W/(\max(r_x) − \min(r_x)) \qquad (1) \\ B_y =(B_{y_0} −\min(r_y))×H/(\max(r_y)−\min(r_y)) \qquad (2) $$ where $(B_x,B_y)$ and $(B_{x_0},B_{y_0})$ are the updated and original vertexes of the polygon and all character boxes; $(r_x,r_y)$ are the vertexes of the proposal $r$.

After that, the target global map can be generated by just drawing the normalized polygon on a zero-initialized mask and filling the polygon region with the value 1. The character map generation is visualized in Fig. 4a. We first shrink all character bounding boxes by fixing their center point and shortening the sides to the fourth of the original sides. Then, the values of the pixels in the shrunk character bounding boxes are set to their corresponding category indices and those outside the shrunk character bounding boxes are set to 0. If there are no character bounding boxes annotations, all values are set to −1.

3.3 Optimization

3.4 Inference

Different from the training process where the input RoIs of mask branch come from RPN, in the inference phase, we use the outputs of Fast R-CNN as proposals to generate the predicted global maps and character maps, since the Fast R-CNN outputs are more accurate.

Specially, the processes of inference are as follows: first, inputting a test image, we obtain the outputs of Fast R-CNN as [40] and filter out the redundant candidate boxes by NMS; and then, the kept proposals are fed into the mask branch to generate the global maps and the character maps; finally the predicted polygons can be obtained directly by calculating the contours of text regions on global maps, the character sequences can be generated by our proposed pixel voting algorithm on character maps.

Pixel Voting We decode the predicted character maps into character sequences by our proposed pixel voting algorithm. We first binarize the background map, where the values are from 0 to 255, with a threshold of 192. Then we obtain all character regions according to connected regions in the binarized map. We calculate the mean values of each region for all character maps. The values can be seen as the character classes probability of the region. The character class with the largest mean value will be assigned to the region. After that, we group all the characters from left to right according to the writing habit of English.

Weighted Edit Distance Edit distance can be used to find the best-matched word of a predicted sequence with a given lexicon. However, there may be multiple words matched with the minimal edit distance at the same time, and the algorithm can not decide which one is the best. The main reason for the above-mentioned issue is that all operations (delete, insert, replace) in the original edit distance algorithm have the same costs, which does not make sense actually.

Inspired by [51], we propose a weighted edit distance algorithm. As shown in Fig. 5, different from edit distance, which assign the same cost for different operations, the costs of our proposed weighted edit distance depend on the character probability $p^c_{index}$ which yielded by the pixel voting. Mathematically, the weighted edit distance between two strings a and b, whose length are |a| and |b| respectively, can be described as $D_{a,b}(|a|,|b|)$, where $$ D_{a,b}(i,j) = \begin{cases} \max(i,j) & \text{if} \min(i,j)=0,\\ \min \begin{cases} D_{a,b}(i-1, j) + C_d \\ D_{a,b}(i, j-1) + C_i \\ D_{a,b}(i-1, j-1) + C_r \times 1_{(a_i \neq b_j)} \end{cases} & \text{otherwise}. \end{cases}$$ (8) where $1_{(a_i \neq b_j)}$ is the indicator function equal to 0 when $a_i = b_j$ and equal to 1 otherwise; $D_{a,b}(i,j)$ is the distance between the first i characters of a and the first j characters of b; $C_d, C_i$, and $C_r$ are the deletion, insert, and replace cost respectively. In contrast, these costs are set to 1 in the standard edit distance.

둘러보기 메뉴