How we published a successful dataset on Kaggle

And how it helped spreading our company data efforts

I first visited Kaggle website about one year ago. I was attracted to solving some basic machine learning problems (such as Titanic: Machine Learning from Disaster). I found out it was also a great place to work on other people datasets, and to share your own data as well.
At that time I worked in a large corporation with very strict compliance rules. Sharing any data would be simply impossible. But the idea persisted in my mind. Some time later, I left that big corp and started working at a medium sized startup called Olist. We are the largest department store inside Brazilian marketplaces. Our catalog has over 200k products and more than 4 thousand sellers sell their products through our platform.
There we had enough lack of bureaucracy to get that idea of publishing a dataset out of my head. I first brought the idea within the company about five months ago, but it was very immature at that time. Some questions were raised, such as:
  • What we should publish? Why?
  • Would someone use our data? How?
  • And… What if any competitor uses our data?
Over the next months the idea evolved and all questions were being answered. Then, after some development process, we were finally ready to share some data. And we did it on Kaggle about two months ago. The purpose of this article is to explain how we published the data, and guide you in this process. Maybe we even encourage you or your company to publish some data as well.

What did we publish? Why?

We decided to publish a dataset with orders and their customer reviews. That would allow solving some classification problems. Our data science team had already made some models to classify reviews, so we would love to see how other people would approach the same problem.
To allow that we decided to release two datasets:
  • A classified dataset with about 4k orders and reviews (only on version 5 or lower).
  • A unclassified dataset with 100k orders.
Why 100k? Well… It was kind of a magic number: bigger than most public datasets on Kaggle. Large enough to provide in depth analysis and for different product categories.
The dataset has information of orders made at multiple marketplaces in Brazil from 2016 to 2018. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. Latter we also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.
Click here to check the dataset latest version
Our data is publicly available at Kaggle.

What data did we share?

First, we published a real dataset. By using our data one may draw some conclusions about the Brazilian e-commerce. We wanted to give a complete sense of what kind of challenges our data team were facing. So we publish as many information as possible. Our only constraints were to:
  • Not release our product catalog (such as titles, descriptions and photos)
  • Not release any customer information
  • Not release any partner information
With that in mind we created a dataset with more than 30 features. It allows exploring an order from multiple perspectives, such as order fulfillment, product categories, geolocation, customer satisfaction, and the list goes on. Bellow we show how the data is structured.
Our data schema. Multiple tables to allow studying an order from multiple perspectives.
You may notice that it reflects a real life situation, where data is stored in multiple tables and sources. This structure is quite different from the average dataset published on Kaggle.
The dataset is well documented, all features explained and enough context was given to allow everyone understanding the data. We created a discussion topic encouraging people to ask questions and then coded a starter kernel with some really basic analysis. That’s all.

Why did we publish it?

There are a few reasons that made us publish the data initially. But now that we did, we have even more reasons. They are:
  1. Contributing to the data community with real, good quality data. We have already seen it being used in college classes.
  2. Position Olist as a reference in the Brazilian data science community. After publishing the dataset we noticed other companies asking us for guidance.
  3. Attract good candidates to our team. It is a lot easier to create empathy and explain what we do by sharing the data. We have already contacted some people who analyzed our data with public kernels.
  4. See some problems from other people’s perspective. Many people contributed with public kernels, by reading them we are discovering different approaches to problems we are working on.
  5. Test candidates. The first step of our selection process now is an open challenge with the dataset.
  6. Fill-in the lack of public databases with text in Portuguese. Help the NLP community in Brazil.
  7. It also helped us to position the company as domain expert. Having this dataset public is working as a kind of certificate saying: trust us, we know what we are doing.

How it redefined our success metrics?

We benchmarked with other public datasets on Kaggle to define what would be success. And we were very modest, at first we thought that having the following would allow us to say the initiative was successful:
  • More than 50 upvotes in two months (we had 200)
  • More than 500 downloads (we had 3.4k)
  • More than 5k views (we had 24k)
  • More that 5 public kernels (we had 21)
Left: our data being used at a college data science class in Natal/RN (photo by Tyrone Damasceno). Right: Public kernels on Kaggle.
This also answered our second question: our data was being used!

What if any competitor uses our data?

That’s a risk. But if they would really like to use it, they would find a way even if we didn’t make the data public. Aggregation of our data may already be found publicly on the stores we sell. Some examples are:
  • Total sales in the past four months
  • Average review score in the past 30 days
Besides that, our entire catalog is publicly available at the stores. It’s just a matter of building a web crawler to fetch that data. Product description, price, photos, freight value and estimated delivery time are all available at the marketplaces where we sell our products.
Even if someone gets our catalog database by crawling the stores, we took care to not allow anyone discovering which product was sold, you just have a anonymized product_id. It is not possible to link our dataset to any product catalog. The data has been anonymized, and references to the companies and partners in the reviews text have been replaced with the names of Game of Thrones great houses. We also sampled it in a way that would not allow others taking any conclusion about our current business size. Those cares had been taken, there was no other reason to hold the data.

What we did wrong?

At the first release we thought that it would be nice to remove complexity from data. So we tried to join as many information as we could in just one table. With time users starting to ask for new features, such as customer, product and seller ids. We also decided to add more features that we thought were important.
We kept the original data structure (one main table) and added other tables with those additional information. Soon enough we had added a lot of new features. And the whole dataset was getting ugly: compound keys, names without any convention and other messy stuff.
We decided to fix that on the version 6 release. We split the whole dataset in different tables, each one containing one aspect of an order. Now the dataset looks more complex than it was before, but the relationships between data are cleaner.
If we could give any advice on this matter it would be to think right from the beginning of a data structure that supports growth and addition of more features. Because your data users will certainly ask for it.

A word of encouragement

We would love to see more datasets like ours publicly available. Until now we only had benefits by sharing the data, and it seems there won’t be any drawbacks.
Right now we are already thinking of the next data ta will go public. How about you? Is there any data you would like to share?
Please contact us if you have any question or concern.

Comments

Popular posts from this blog

Five Minutes to Your Own Website

15 Websites To Get Creative Commons Music For Free

Object detection and tracking in PyTorch