Document Visual Question Answering System — A Serviceable Case Study

Yash Dixit
24 min readFeb 16, 2021

This Case study answers questions asked on the Documents. This could have multiple applications in Banking, Finance, Supply Chain and literally everywhere possible.

1. Problem Statement —

There has been a lot of research based on the useful information extraction from the document images till date. These reading systems did not only extract and interpret the textual (handwritten, typewritten or printed) content of the document images, but also exploited numerous other visual cues including layout (page structure, forms, tables),non-textual elements (marks, tick boxes, separators, diagrams) and style (font, colours, highlighting), to mention just a few.

Also, there are many models based on Question Answering systems based on just texts using NLP tasks or even Scene Text Question Answering systems. These approaches have either focused on specific document elements or on specific collections such as book covers or number plates.

However, there hasn’t been much innovation with respect to “Document Visual Question Answering” which focuses on a specific type of Visual Question Answering task, where visually understanding the information on a document image is necessary in order to provide an answer. For example,

Sample document from DocVQA.

The questions asked on the above document would be :

What is the issue at the top of the pyramid? — Retailer calls/ other issues.

Which is the least critical issue for live rep support? — Retailer calls/other issues.

Which is the most critical issue for live rep support? — Product quality/liability issues.

Thus we, the participants, are asked to create a robust model which will return answers based on the right questions asked on the right document. The model should be able to detect the layout of the document in order to get the right answers from the requested document.

2. Real World Objectives and Constraints -

  1. Predict the answers of the questions based only on the document images and/or the OCR outputs of that image.
  2. The answers should be obtained in a relatively less time possible.
  3. Interpretability is important since the question asked on the specific document should return the right answer and not an answer present in another document.
  4. We have to try our best in obtaining the best possible ANLS score.
  5. Penalise the wrong answers predicted with respect to the right document.

3. Data Collection and Details -

  1. The data-set is publicly available on the RRC website.
  2. The data-set consists of 12,767 document images of varied types and content, over which there are 50,000 questions and answers which are defined. The questions defined are categorised based on their reasoning requirements such as ‘What are’, ‘Who is’, etc..
  3. The data-set has been split randomly by the organisers in an 80–10–10 ratio to train, validation and test splits. The train split therefore has 39,463 questions and 10,194 images, the validation split has 5,349 questions and 1,286 images and the test split has 5,188 questions and 1,287 images.
  4. Each of the datasets have 2 folders and a .json file present in them.
  5. The ‘documents’ folder has all the document images present in them. The ‘ocr_results’ folder consists of the ground truth annotations of those document images in a .json format(OCR results to be precise).

The .json file which comes with the data-set has the following format(explanation in italics):

{

“dataset_name”: “docvqa”, The name of the data-set, should be invariably “docvqa”

“dataset_split”: “train”, The subset (either “train” or “test”)

“dataset_version”: “0.1”, The version of the dataset. A string in the format of major.minor version

“data”: [{…}]

}

The ‘data’ element is a list of document entries with the following structure:

{

“questionId”: 52212, A unique ID number for the question

“question”: “Whose signature is given?”, The question string — natural language asked question

“image”: “documents/txpn0095_1.png”, The image filename corresponding to the document page which the question is defined on. The images are provided in the /documents folder

“docId”: 1968, A unique ID number for the document

“ucsf_document_id”: “txpn0095”, The UCSF document id number

“ucsf_document_page_no”: “1”, The page number within the UCSF document that is used here

“answers”: [“Edward R. Shannon”, “Edward Shannon”], A list of correct answers provided by annotators

“data_split”: “train” The dataset split this question pertains to

}

4. Machine Learning problem -

Objective : Predict the answers of the questions based on the OCR outputs and the document images.

Metric : The evaluation metric to be used for this case study is Average Normalised Levenshtein Similarity (ANLS). The ANLS smoothly captures the OCR mistakes applying a slight penalisation in case of correct intended responses, but badly recognised. It also makes use of a threshold of value ‘τ’ that dictates whether the output of the metric will be the ANLS if its value is equal or bigger than ‘τ’ or 0 otherwise. The key point of this threshold is to determine if the answer has been correctly selected but not properly recognised, or on the contrary, the output is a wrong text selected from the options and given as an answer. More formally, the ANLS between the net output and the ground truth answers is given by equation 1. Where N is the total number of questions, M total number of Ground Truth answers per question, aᵢⱼ the ground truth answers where i = {0, …, N}, and j = {0, …, M}, and oᵩᵢ be the network’s answer for the iᵗʰ question qᵢ . It is not case sensitive, but space sensitive.

Formula of ANLS score

Over here NL(aᵢⱼ , oᵩᵢ) represents the Normalised Levenshtein distance between the strings aᵢⱼ and oᵩᵢ . The Normalised Levenshtein distance is a value between 0 and 1. Since the condition is 1 — NL(aᵢⱼ , oᵩᵢ ), that is why the ANLS metric has the term Normalised Levenshtein “Similarity” rather than Normalised Levenshtein “Distance” because NL(Similarity) = 1 — NL(Distance). The Levenshtein distance between two strings a, b (of length |a| and |b| respectively) is given by lev(a, b) where

Formula of Levenshtein Distance

where the tail of some string x is a string of all but the first character of x, and x[n] is the nᵗʰ character of the string x, starting with character 0. Note that the first element in the minimum corresponds to deletion (from a to b), the second to insertion and the third to replacement.

We then define a threshold τ (usually 0.5) to filter NL values larger than this value by returning a score of 0 if the NL is larger than τ. The intuition behind the threshold is that if an output has a normalised edit distance of more than τ to an answer, I reason that this is due to returning the wrong document image text instance, and not due to recognition errors. Otherwise, the metric has a smooth response that can gracefully capture errors both in providing good answers and in recognising the right document text. All methods submitted as part of the competition are evaluated automatically using the above protocol at the RRC portal as well.

5. Exploratory Data Analysis -

5.1. Images —

The width of the images ranges from 200px to 7219px and the height ranges from 206px to 9723px.

Now, I had tried to figure out whether there was a relationship between the widths and heights of the images.

GG Plot of Widths vs. Heights of Images

Thus, with this plot we can notice that there were many images with image widths less than 2000 and image heights less than 2500.

5.2. OCR —

After that, I wanted to check whether there were any relationships between the widths and heights of the images with respect to the length of the OCR texts.

GG plot of Length of OCR text vs. Image Width.
GG plot of Length of OCR text vs. Image Height.

With this I concluded that less the length of the OCR texts are, less the image width and heights are.

5.3. Questions —

Let’s see which questions had the highest frequencies.

Top 50 Questions based on Frequencies

After that I had done an analysis on what the average image widths and heights were for these questions with the highest frequencies.

Average Widths and Heights of the Top 50 Questions. Format — Question : Width, Height

I couldn’t gain any useful insights out here since the heights and widths of the Images ranged pretty similar for most of the questions. So, I dove deep into the analysis based on the widths and heights images for the top questions and so I came up with aspect ratios of the images.

There are various Aspect Ratios with respect to images.

Some of which are -

  1. 1:1 = 1/1 = 1
  2. 4:3 = 4/3 = 1.33
  3. 16:9 = 16/9 = 1.78
Questions of Images having average aspect ratio more than 4:3

Thus there was only one question which had an average aspect ratio greater than 4:3.

Questions with Images having average aspect ratio close to 1:1

The questions mentioned in the above image had average aspect ratios close to 1:1. This in turn meant that the images were a bit square.

Questions with Images having average aspect ratios of 4:3

Most of the questions asked on the documents related to letters had an average aspect ratio of 4:3.

Finally I output which were the top words within the questions.

WordCloud of the words in all the questions in the Train Data

This is a WordCloud where we can see the words in questions on the basis of the font size which indicates that bigger the font is, more the frequency of the word is.

5.4. Answers —

Now let’s see what the lengths of the answers were and how the lengths of the answers varied in the entire data-set.

Bar plot of the frequency of number of words in Answers

It looked like that there were a lot of answers with number of words less than 6.

Later on, I did the same WordCloud analysis for the Answers too.

This is a WordCloud where we can see the words in answers on the basis of the font size which indicates that bigger the font is, more the frequency of the word is.

WordCloud of the words in all the answers in the Train Data

5.4. Questions vs. Answers —

Further on, I had decided to gain insights on the average number of words of answers for the top 50 questions as discussed before.

Average Number of words of answers with respect to the top 50 questions

Thus, there were on an average of 3 words in the answers with respect to the top questions. It looked like it was obvious in some cases that the page numbers or anything related to numbers had one lettered answers.

6. Featurization

In this section, I will be discussing about the basic initial featurization which I had performed to make the tasks of featurization in modelling for each model easy.

I constructed a data-frame which had the columns :

Image — Links of the Images to where the images were present in the respective folders.

OCR — Links of the OCR outputs in the .json formats to where they were present in the respective folders.

Question — Unique question asked on the respective image.

Question_Id — Question ID of the Question asked which was used for the final submission on the RRC website

Answer(For Train and Validation sets only) — Unique answer present in the answer_list. For this case study, I had considered the first answer in the list.

Answer_list(For Train and Validation sets only) — List of correct answers based on the unique image, OCR output and the question.

For creating the data-frames of the respective data-sets I had used 2 functions check_data and prepare_df.

The check_data function only returns a boolean value based on whether all the lengths of the lists corresponding to the unique column were equal or not.

Code : check_data Function

Thus, you can see in the above code that it just takes the lists into consideration and based on the type of the data-set it checks whether the lists are of equal length or not.

Now, for creating those lists for the columns, the prepare_df function was used.

Code : prepare_df Function

The prepare_df is just a simple function where it takes all the questions, answers, question IDs, OCR links and image links from the .json file and append them to separate lists. After that, it checks whether the lengths of the lists are same or not using the check_data function. It then creates and returns the data-frame using those lists as columns.

7. Modelling -

Since the data-set was kind of huge and also the interpretability of the predicted answers was important, I decided to opt for the Sequence-to-sequence (abrv. Seq2Seq) with attention architecture. This is because it has a lightweight structure and also further on I decided to opt for GRU layers instead of LSTM ones which further aided in getting better time complexities.

Since this is an unsupervised learning task, I used the FastText embeddings for the words of questions, answers as well as the OCR. FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. I used the model directly from the FastText website which can be found in the .bin file over here (Download link — https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip.It contains 2 files with file extensions of .bin and .vec respectively).

7.1. Baseline Model (Model based only on Image and the Questions)

7.1.1. Introduction

The baseline model had a simple structure where-in I had used only the Images for predicting the answers.

7.1.2. Featurization

Initially, I loaded the data-frames of the respective data-sets. Firstly, I carried out the text pre-processing on the textual features which are questions and the answers for this baseline model.

For pre-processing the questions and answers, I added spaces between the textual/numerical contents and the punctuation present in every question and answer. Further on, I added the <start> and the <end> tokens before and after every question and answer. For this, I had used the preprocess_qa function.

Code : preprocess_qa Function

Using this function, I had created 2 lists which were of pre-processed questions and answers for all the data-set types(train, validation and test).

Since this is an unsupervised learning task, I had to create tokens for all of the texts present in train, validation as well as the test set for predictions. I opted for this approach is because, predicting the word embeddings using the fast-text model every time in the encoder would increase the time complexity of training by a huge margin. So, to make the train time even more efficient, I created the embedding matrix of those unsupervised tokens too so that it’ll be more interpretable in terms of writing the code as well. This further on led to adding an Embedding layer in the Encoder as well as the Decoder.

Thus moving onto the tokenizing part I had included all of the text data present in train, validation as well as the test set and returned a single universal tokenizer (the variable tokenizer returned in the code below).

Code : tokenize Function

Thus you can see in the function above, I have appended all the texts into a single list ‘tr_inp_q’ and initiated the tokenizer on all of them. After that I tokenized the respective train and validation sets of questions and answers and padded and returned them.

As stated above, I then created the embedding matrix by loading the fast-text model and then predicting the embeddings.

Code : Creating the Embedding Matrix

The fasttext.load_model() function is present in the fasttext library which can be installed directly using pip. For more information click here.

Now since the text features were now ready, it was now time for the images to be pre-processed.

I tried out various models for getting the best scores and the least losses. Some of them were Efficient Net, MobileNetV3Large and MobileNetV3Small. The best score was achieved by using the InceptionV3 model with pretrained weights. Although MobileNetV3Small model gave the least time complexity it didn’t perform that great. The time complexity using InceptionV3 was somewhat average but the results were great. I directly imported the models and the pre-trained weights using the tensorflow applications package.

I followed the same structure of pre-processing as mentioned in the image captioning project by tensorflow. The features were extracted in the same way as mentioned over here.

Moving on, I created the tensorflow data-set in the same way that was mentioned in the image captioning project which is mentioned on the tensorflow’s website.

Code : Creating the tf.Dataset

Firstly I merged all the image feature links(links to the features stored in the form of numpy files), tokenized questions (input features) and the tokenized answers(target features). Later on I mapped the map_func() in loading the numpy files directly onto the RAM to get the image features as well.

The data-set was now ready for training.

7.1.3. Modelling

I did not used the tradition Sequential model approach for modelling. This is because, using the training loop enables in exploring and modifying a lot things that go into training the model which cannot be done in the traditional Sequential model.

The Encoder consisted of Encoding the Image and Question features.

Code : Encoder Class

Basically, I passed the Image features through Dense layer and further on through the Relu Activation. For the Questions, I passed the tokenized inputs through the Embedding Layer which gave the Embedding outputs corresponding to the respective tokens. These outputs were then passed through the GRU layer which gave the question output and the hidden state corresponding to the output. I sent hidden states as zeros every time through the question GRU since the questions were unique with respect to the images and the answers as well.

Coming to Decoding the Answer, here is the decoder.

Code : Decoder Class

Over here the encoded outputs of questions and images are iterated through this class for every time-step till it predicts the word ‘<end>’ of the answer. Firstly, the encoded outputs are passed through their respective attention layers. The attention layers are similar to that of the image captioning model and the Neural machine translation applications of tensorflow. The traditional Bahdanau Attention approach was used for both the question and image attentions since this was a baseline model. You can go through the attention layer of image here and for the question over here. Please refer the class BahdanauAttention in both the web-pages for further details.

The image hidden state passed through the decoder was unique for every image, question and answer. This image hidden state was just zeros as I had passed initialised question hidden state in a similar fashion for the question GRU layer where it was the initial hidden state for question GRU layer.

To sum everything up this the brief blueprint of the entire seq2seq architecture of the baseline model.

Architecture of the Baseline model

The image above would make everything understand much better. So I request the reader to go through this and check out the code simultaneously as mentioned in my github repository mentioned below.

Moving on, I started training the model. This is the best architecture I could build to get the least loss and the best ANLS score for the baseline model even if it over-fitted. The over-fit was caused because the model did not have a reference of how to predict the answers in an unsupervised fashion. The model only predicted the answers from the train set in the validation set. So basically it worked exactly like the Image Captioning model except that this model was somewhat like a Image-Question Captioning Model (If you get what I mean).

7.1.4. ANLS Score

I got a Train ANLS Score of 0.06 and a Validation ANLS Score of 0.01. These scores were calculated where the actual answers and the predicted outputs had white-spaces present between the punctuation and the characters.

7.1.5. Submission

After submitting the test results on the RRC website, I got a test ANLS of 0.0552.

Ranking table on RRC’s website

This was a pretty great score given the hardware limitations and also I was under the intention that since the model over-fit with the validation data before, it gave a test score almost equal to the score obtained by the Train data.

7.2. Second Model(Model based on Image, its OCR outputs and the Questions)

7.2.1. Introduction

This model took sequences of questions as well as the OCRs along with the images as inputs and predicted the answers.

7.2.2. Featurization

Initially, I loaded the data-frames of the respective data-sets. Firstly, I carried out the text pre-processing on the textual features which are questions and the answers and also OCR this time.

For pre-processing the questions, OCRs and answers, even in this case I added spaces between the textual/numerical contents and the punctuation present in every question, OCR and answer. Further on, I also added the <start> and the <end> tokens before and after every question, OCR and answer too. For this, I had used the preprocess_qa and preprocess_ocr function. The preprocess_ocr function is just an addition over the preprocess_qa. The preprocess_ocr just reads the .json files line by line and pre-processes all the text using the preprocess_qa function when all the lines are read.

Code : preprocess_ocr Function

The preprocess_ocr uses the preprocess_qa function alongside which is the same above as mentioned in the section 6.1.2..

Using these 2 functions, I had created 3 lists which were of pre-processed questions, ocrs and answers for all the data-set types(train, validation and test).

Code : create_dataset Function

In the code above you can notice one thing that I did not include the entire OCR sequence. I had truncated the sequence length to only 220 character tokens. Since I had memory limitations I had to do the truncation task. Also since almost around 30000 out of the total 39464 data-points in the train OCR data had lengths less than 221, the loss of OCR information wasn’t that drastic.

After that I had to create tokens for all of the texts present in train, validation as well as the test set for predictions the same way as section 6.1.2.. I then created the embedding matrix of those unsupervised tokens.

Thus moving on to the tokenizing part I included all of the text data present in all the data-sets and returned a single universal tokenizer in just the same way as I had done earlier. The only addition was that just the ocr texts were tokenized and padded too.

Moving on I pre-processed the image features in the same way I had done before and also created the tf.Dataset for training and validation similarly.

7.2.3. Modelling

The training loop used here for this model has been slightly modified to give better results.

The Encoder consisted of Encoding the Image, Question and OCR features.

Code : Encoder Class

Basically I passed the Image features through Dense layer and further on through the Relu Activation. For the Questions, I passed the tokenized inputs through the Embedding Layer which gave the Embedding outputs corresponding to the respective tokens. These outputs were then passed through the GRU layer which gave the question output and the hidden state corresponding to the output. I sent hidden states as zeros every time through the question GRU since the questions were unique with respect to the images and the answers varied as well. After that, I passed final the hidden state output of the question GRU to the initial hidden state for the OCR GRU. This helped me in improving the ANLS score.

Coming to Decoding the Answer, here is the decoder of the second model.

Code : Decoder Class

Over here the encoded outputs of question, OCR and image are iterated through this class for every time-step till it predicts the word ‘<end>’ of the answer. Firstly, the encoded outputs are passed through their respective attention layers. The attention layers are similar to that of the image captioning model and the Neural machine translation applications of tensorflow. The traditional Bahdanau Attention approach was used for both the question and image attentions since this was a baseline model. You can go through the attention layer of image here and for the question over here. Please refer the class BahdanauAttention in both the web-pages for further details.

The only difference is that the scoring function was slightly tweaked. The scoring function used in the tensorflow open source projects was using ‘tanh’.

Bahdanau’s additive Scoring Function

The problem with this scoring function was that it returned values which were both positive and negative floating point numbers. This not only increased the time complexity for the further processing in answer’s GRU layer, but also this did not aid in giving a better ANLS score.

Now, since many of us are aware that Relu returns numbers which are either positive or 0, it helped me in getting a much better time complexity as well as the score improved significantly. So, the scoring function which I used was tweaked just by replacing tanh with Relu function.

Modified Scoring function for all Attention layers

So, basically, all of the attention layers had this ‘call’ function where this scoring function was used.

Code : call Function for all Attention layers

Now since this problem statement is supposed to be somewhat like a categorical supervised learning, in our minds the first thing which comes up in our minds is using ‘softmax’ as the final output activation, right? But, no! In my case study, using sigmoid as the last activation played a significant role. It not only reduced the time complexity by a huge margin but it also gave me a significant improvement in ANLS scores. Of course, for this I had to disable from_logits in the loss function.

Now, as mentioned earlier, I have made a lot of changes in the training loop for this model. Firstly I had created a function for training the model. This is because I had a 23GB of usable RAM and a Nvidia T4 Tensor Core GPU under Colab Pro’s service and so I had to provide an epoch range to the function every time I loaded the Checkpoint of the model. Moving on, I had also used the tensorflow writer to write the logs for tensorboard for this model as well as the baseline model. I also shuffled the train data for training in every epoch. Shuffling the data every time helped me in reducing the over-fit by a huge margin.

The image hidden state passed through the decoder was unique for every image, question and answer. This image hidden state was just zeros as I had passed initialized question hidden state in a similar fashion for the question GRU layer where it was the initial hidden state for question GRU layer. For the question and OCR attentions, I performed the addition task on the encoder question and OCR hidden states and passed the summed hidden state output through the attention layers of question and OCR. Even this helped me in improving the time complexity and also helped me in getting rid of the over-fit. For every time step, I used the decoder hidden state output in place of question hidden state so that it enabled in predicting the outputs better.

To sum everything up this is the brief blueprint of the entire seq2seq architecture of the second model.

Architecture of the Second model

The image above would make everything understand much better. So I request the reader to go through this and check out the code simultaneously as mentioned in the github link mentioned below.

Moving on, I started training the model. This is the best architecture I could build to get the least loss and the best ANLS score for this model. The model started over-fitting from 6th epoch on-wards but the over-fit wasn’t that much as compared to the baseline model. I had used several circular and triangular learning rate schedulers on batches as well as epochs but the model wasn’t able to learn much after using it. This time round the model predicted some new words from the OCR in the validation set and so it worked in an unsupervised way.

7.2.4. ANLS Score

I got a Train ANLS Score of 0.2727 and a Validation ANLS Score of 0.1292 from the best model. These scores were again calculated where the actual answers and the predicted outputs had white-spaces present between the punctuations and the characters.

7.2.5. Submission

After submitting the test results on the RRC website, I got a test ANLS of 0.1081.

Ranking table on RRC’s website

On further analysis I found out that the model retains the format of how the answers are supposed to be surprisingly based on the question and the OCR text even if the predicted answers differ. For example, If the question was :

‘What is the time?’

The predicted answer in validation sometimes comes from the train set but the prediction maintains the format of ‘HH:MM’ which is very fruitful and convenient if we think it from a perspective of a Product’s standpoint. Usually the answers which are to be predicted from tables, diagrams, handwritten texts and images/photos(e.g.: pie-chart or a picture of an individual) which were present in the document image was difficult for the model to figure out.

8. Final Submission -

8.1. Final verdict.

Comparing the performances of the models

Thus having a look at this table suggests that the final model obviously performed well!

For submitting the final .json file on RRC website I used the Final Model.

8.2. Submission.

For submitting the final results I appended all the predicted results into a .json file along with their Question IDs

Code : Creating the .json file of the predicted outputs

Later on I uploaded the .json file to the RRC website and got an ANLS score of 0.1081 as mentioned before in Section 6.2.5..

9. Deployment -

I have included the final.ipynb in my github repository where there are two functions. The first function(Function_1) returns the predictions on a single data-point or multiple data-points. The second one(Function_2) returns ANLS score along with the predictions.

You can also deploy it directly on any cloud directly with the files included in Deploy.zip. I have included the html files over there as well wherein you can input single or multiple data-points to get the predictions and the total time of execution.

Video for Demonstrating the app on a local system

In the video above, I had deployed it on a local machine using flask. I have demonstrated it using multiple inputs scenario. It usually takes around ‘1 second’ for a single prediction with a machine with 16GB of RAM and a Nvidia 1050Ti Graphic Card. In the video, many applications, windows and tabs were open and so it took a little bit more time for multiple predictions. Even with multiple predictions(trying it out with 500 predictions) it returns on an average of <1 second per prediction depending on the image size and the OCR length.

10. Summary -

Future Work:

  1. In the second model, I had passed the question GRU’s last hidden state output as the OCR GRU’s initial state. Instead of that, using every hidden state for every word in question and using every hidden state for every word in OCR, dot products of every question and OCR hidden state could be performed. This could enable in finding the positions of the answers in the OCR much better. This idea has been deeply explained over here.
Model Architecture

Thus this was the model architecture which I had discussed above.

Attention Encoding

Over here, you can see is that all the question hidden states(H_Q) have been multiplied(dot product) with the OCR hidden states(H_P).

2. My current implementation only relies on text and textual position information and partially includes visual information. Also, the one discussed above has been worked more towards the textual information. The visual modality can be achieved by sending the visual information of the text. In the data-set provided, the X and Y values of the bounding boxes with respect to the image arrays for both words as well as the collection of words in the OCR are present. These values are present in the .json files of all the OCR outputs corresponding to the respective image. For example, in the data-set the X and Y values are given as follows.

For every word or collection of word : (X1,Y1,X2,Y2)

where :

X1 represents the X value of the left top corner of the bounding box of the word present in the image.

Y1 represents the Y value of the left top corner of the bounding box of the word present in the image.

X2 represents the X value of the right bottom corner of the bounding box of the word present in the image.

Y2 represents the Y value of the left bottom corner of the bounding box of the word present in the image.

So accordingly, the 2D floating arrays of image can be extracted for each word present in the respective document image. These 2D image arrays can then be resized to a constant size which could be for example, 9px(width) * 9px(height), for each word present in OCR. These arrays of sizes 9x9 each can be flattened to get an array of size 81 for each word. This could enable in getting the visual information of each word of OCR in the respective image. Also, it would behave in a similar way as the embedding matrix as I had created using fasttext for this case study. Basically, for every unique image, there is a unique OCR, and so the embeddings of the words present in the ocr would resemble those 1D flattened arrays of size 81 each. This idea would not only decrease the time complexity by a tremendous margin but will also increase the visual interpretability a lot.

Conclusion:

This was my second self-case study on Machine Learning and also my first ICDAR submission! Although the score was pretty less, I think it was still great given the hardware limitations!

I got to learn tonnes of unsupervised techniques while working to improve the score on the machine learning modelling. It always feels great to read from books and blogs about the methodologies, but unless you do it on your own and learn things from scratch, you won’t get a good idea about how you can solve this in practical.

This concludes my work. Thank you for reading and going through everything! :)

My code is open to all for download on Github.

Y‘all can also find and connect with me on LinkedIn and GitHub.

References:

  1. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1174/reports/2761153.pdf
  2. https://www.tensorflow.org/tutorials/text/image_captioning

--

--