PseudoAJ: PseudoAJDataset#0: Telugu handwritten character dataset in MNIST format

PseudoAJDataset#0: Telugu handwritten character dataset in MNIST format

Why?. Seriously, why?

The Italian of the east they call, Telugu is spoken by over 75 million people all over the world. So, when my professor asked me to pick a project to put Deep Neural Networks into action; I choose to build a universal character recognition model as my topic. As a first step to achieve this task, I choose to build a Convolution Neural Network with TensorFlow to recognize the Telugu characters which I then wanted to extend to different languages too.

Data Collection

Now there is one problem.

As you can expect, there are no available dataset available for the Telugu handwritten characters. Although, I did reach the authors who have published some papers in this area. I did get some help but the data was in matlab and it was late for me to work with the data formats.

So, I made an android app to draw and save the characters as images.

Dataset

Now, I have the dataset, I wanted to save the image and use the same file formats of MNIST so that the dataset will work seamlessly with all the MNIST tutorials out there. Following are the links for the files:

t10k-images-idx3-ubyte

t10k-labels-idx1-ubyte

train-images-idx3-ubyte
train-labels-idx1-ubyte

The files have the same format and conventions as that of MNIST dataset, except that this is a much smaller dataset and it is growing. It consists of 5281 training images and 1591 testing images. It has 16 classes of characters as illustrated in the header image.

To-do

1. Open source apk for data collection.

2. Add more datasets to make it comparable to MNIST

3. Open source TensorFlow model

4. Open source Android app for telugu character recognition with TensorFlow backend

Acknowledgements

Firstly, I acknowledge Rahul Kavi for pitching in this project and helping in obtaining the datasets. Thanks to Dr. Nasrabadi for giving thumbs up to this project and thanks to all the people who contributed for the dataset.