Numerical experiments, Tips, Tricks and Gotchas

Numerically speaking

Datasets for Data Mining, Machine Learning and Exploration


Reference datasets for tests, benchmarks, etc.


  1. Rdatasets is a collection of 758 datasets that were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.
  2. [+] This page contains a list of datasets that were selected for the projects for Data Mining and Exploration. Students can choose one of these datasets to work on, or can propose data of their own choice. At the bottom of this page, you will find some examples of datasets which we judged as inappropriate for the projects.
    • Particle physics data set
    • Physiological data set
    • Brain-Computer Interface data set
    • Prediction of Gene/Protein Localization data set
    • Prediction of Molecliar Bioactivity for Drug Design: Binding to Thrombin dataset
    • The 4 Universities dataset
    • Internet advertisements dataset
    • The Reuters-21578 text dataset
    • The charitable donations dataset
    • The caravan insurance data
    • The yeast S. cerevisiae gene expression vectors
    • The colon cancer data
    • The leukemia data set
    • The human splice site data
    • Volcanoes on Venus
    • Network intrusion data
    • The SuperCOSMOS Sky Survey objects catalogue
    • Less interesting datasets
    Datasets for Data Mining.
  3. [+] SpatialKey Sample data.
    • Sample insurance portfolio
    • Real estate transactions
    • Sales transactions
    • Company Funding Records
    • Crime Records
    SpatialKey Sample data.
  4. This data set used in the CoIL 2000 Challenge contains information on customers of an insurance company. The data consists of 86 variables and includes product usage data and socio-demographic data.
    Insurance Company Benchmark (COIL 2000) Data Set
  5. This data set is used in Brandon Rhodes tutorial Pandas From The Ground Up - PyCon 2015. For generating data in PyCon 2015 Pandas tutorial materials curl is used. Windowd users can install cURL for Windows.
  6. [+] GroupLens.org Datasets. GroupLens Datasets.
  7. Finance and economic data in the form you want; instant download, API or direct to your app: Quandl. Quandl unifies over 20 million financial and economic datasets from over 500 publishers on a single user-friendly platform.
  8. Datasets from the Deep learning website: Datasets. These datasets can be used for benchmarking deep learning algorithms:.
  9. Several classic datasets have been used extensively in the statistical literature: Classic datasets.

Data Science Central



© Nikolai Shokhirev, 2012-2017

email: nikolai(dot)shokhirev(at)gmail(dot)com