The Gas Prices of America (GPA) Dataset


The GPA is a real-world, benchmark image dataset for developing an evaluating machine learning and character/digit recognition algorithms. The GPA differentiates itself from other digit recognition datasets such as the SVHN dataset in that the images within contain multiple multi-digit numbers. Consequently, the GPA dataset can be used as a benchmark for multiple levels of digit recognition difficulty.

As with the MNIST and SVHN datasets, the GPA comprises high quality images (obtained from Gas Prices found in Google Street View images) organized to minimize data preprocessing and formatting. The digit recognition task presents a significantly harder, unsolved, real world problem (recognition of multiple multi-digits numbers in natural scene images).

Examples of easy and challenging images:




Description of GPA


The current GPA comprises a subset of 2,048 images.
Each image is 640X640px in either .jpg or .png format. Annotations for each image are available in the following formats:

GPA Dataset Segmentation Classification
(1) Sign-Level (2) Price-Level (3) Digit-Level (4) Label-Level (5) Single Price
(Reg., Unl., Cash)
(6) All Prices
(no fractions)
(7) All Prices
(incl. fractions)
(8) All Prices
(incl. frac. & grade)
1024 Subset
2048 Subset
Superset


Publications


The GPA was originally introduced in the proceedings of the 17th Conference on Computer and Robot Vision (CRV) 2020:

Gas Prices of America: The Machine-Augmented Crowd-Sourcing Era


[Paper] [Poster] [Video] [Github]

Abstract Google Street View (GSV) comprises the largest collection of vehicle-based imagery of the natural environment. With high spatial resolution, GSV has been widely adopted to study the natural environment despite its relatively low temporal resolution (i.e. limited time-series imagery available at a given location). However, vehicular-based imagery is poised to grow dramatically with the prophesied circulation of fleets of highly instrumented autonomous vehicles (AVs), producing high spatio-temporal resolution imagery of urban environments. As with GSV, leveraging these data presents the opportunity to extract information about the lived environment, while their high temporal resolution enables the study and annotation of time-varying phenomena. For example, circulating AVs will often capture location-coded images of gas stations. With a suitable CV system, one could extract the advertised numerical gas prices and automatically update crowd-sourced applications, such as GasBuddy. To this end, we assemble and release the Gas Prices of America (GPA) dataset, a large-scale, benchmark dataset of advertised gas prices from GSV imagery across the 49 mainland United States of America. Comprising 2,048 high quality annotated images, the GPA dataset enables the development and evaluation of CV models for gas price extraction from complex urban scenes. More generally, this dataset provides a challenging benchmark against which CV models can be evaluated for multi-number, multi-digit recognition tasks in the wild. For the digit-level classification task, the YOLO digit detection model trained on the Street View House Numbers dataset performed comparably to a random classifier, highlighting the difficulty of this task. Conversely, for the fullsign segmentation task, transfer learning of a DeepLabV3 ResNet101 model achieved a test F1 performance of 0.7125, following 100 epochs. Highly accurate models, when integrated with AV platforms, will represent the first opportunity to automatically update the traditionally human crowd-sourced GasBuddy dataset, heralding an era of machine-augmented crowd-sourcing. The dataset is available online at cu-bic.ca/gpa and at doi.org/10.5683/SP2/KQ6VNG. Accompanying code can be found at github.com/GreenCUBIC/Gas-Prices-of-America.

Contributors to the GPA



Kevin Dick


Kevin Dick is currently pursuing a PhD in biomedical engineering specializing in data science and bioinformatics as part of the Carleton University Biomedical Informatics Colaboratory (cuBIC) in Ottawa, Canada. His research interests include data science, machine learning, high performance computing, secodary use of autonomous vehicle data, and scientometrics.



Francois Charih


François Charih is currently a PhD student in Electrical and Computer Engineering in the Carleton University Biomedical Informatics Colaboratory (cuBIC). His research interests include bioinformatics, applied machine learning and software development for applications in personalized medicine. Other areas of expertise include cloud computing, web development and science outreach.



Jimmy Woo




James R. Green


Dr. Green is a full professor in the Department of Systems and Computer Engineering at Carleton University. His research focuses on the application of machine learning to challenges in biomedical informatics, particularly in the presence of class imbalance. Current research projects include the prediction of protein structure, function, and interaction; the use of supervised and semi-supervised machine learning for the identification of microRNA in unique species; unobtrusive and non-contact neonatal patient monitoring; developing ML for audiology; applying computer vision to autonomous vehicle imagery; and the acceleration of scientific computing using parallel computing.





Made in with `;

Citation


If you use the GPA dataset in your work, please cite the following references:

@data{SP2/KQ6VNG_2020,
author = {Dick, Kevin and Charih, François and Woo, Jimmy and Green, James R.},
publisher = {Scholars Portal Dataverse},
title = "{Sampled 2048 GPA Images with Annotated Regular Gas Price}",
UNF = {UNF:6:Sawptw7Psv/m8VVAsQ6m4w==},
year = {2020},
version = {V1},
doi = {10.5683/SP2/KQ6VNG},
url = {https://doi.org/10.5683/SP2/KQ6VNG}
}

Download the GPA


All files of the GPA dataset are available from the following DataVerse repository: GPA Dataset

GPA Dataset Segmentation Classification
(1) Sign-Level (2) Price-Level (3) Digit-Level (4) Label-Level (5) Single Price
(Reg., Unl., Cash)
(6) All Prices
(no fractions)
(7) All Prices
(incl. fractions)
(8) All Prices
(incl. frac. & grade)
1024 Subset Download Download
2048 Subset Download
Superset