Making sense of data licenses
Paul Gagnon Paul Gagnon
May 6 4 min

Making sense of data licenses

The License Generator found at montrealdatalicense.com would not have happened without the amazing contributions of Ike Saunders, Minh Dao, Matt Leus, Christian Jauvin and many others at Element AI.

Have you ever been in this situation where it’s unclear what can and can’t be done with data, even though it is readily available? We tackled this issue in a newly published paper, “Towards Standardization of Data Licenses: The Montreal Data License”, accepted at the AI for Social Good Workshop at ICLR 2019 in New Orleans. We explained the dynamics surrounding data licensing, honing in on conceptual issues and uncertainties. The paper also proposes a new family of license: the Montreal Data License.

We are a team of lawyers and scientists: Misha Benjamin, Paul Gagnon, Negar Rostamzadeh, Chris Pal, Yoshua Bengio and Alex Shee. We decided to focus on resolving some of the challenges inherent to data licensing by examining the legal terms and conditions under which data is made available. The idea for this article was born out of a routine question that is part of our day-to-day at Element AI. A typical conversation on this topic often goes like this:

Researcher: Hey ! I found this really interesting dataset and I was wondering if I could use it. Can you double-check the license terms to make sure it’s usable?

Lawyer: Oh, awesome ! Yeah, let me check for you !

Five minutes later...

Lawyer: So, about that dataset. it’s really unclear what can be done with it or not. I hate to do this, but my answer is: it depends what you mean by “use”!

How data is licensed is often ambiguous. It is difficult to determine the intention of individuals and companies making data available because the language used is vague and open to interpretation. This places a particular burden on the more scrupulous startups and academics that want to ensure their use of data respects the conditions and intent of those having made data available.

As a comparison, the terms to use open-source software  are fairly standardized. Most open-source contributors use licenses such as the MIT, BSD, GPL or Apache licenses. At a glance, seeing these names already conveys a good idea of the conditions to using this license because open-source communities rely on decades of using these licenses. We know what these licenses mean because they’re clear, standardized and commonly used.

Our paper delves into the issues related to license language today, and highlights the ambiguities and issues in that language. The paper also considers some of the more commonly used data sets in order to highlight the issues with them and create a framework for better licensing language in the future.  

Simply identifying the problem isn’t enough. In order to really address the problem, we created an automatic license generator available at montrealdatalicense.com. The License Generator is a simple tool that enables people that want to make data available to click through a simple Q&A that automatically generates a license that matches the permissions the user wants to give.

In the world of AI and machine learning, context is everything. The Montreal Data License attempts to acknowledge the different contexts in which data can be used. The clearer and more technically precise language used in the Montreal Data License can help someone making data available decide what can and cannot be done. Instead of focusing on how use is qualified as academic, commercial, non-commercial, research and so forth, the Montreal Data License expands the notion of “use” itself into clearer, modular actions that can be taken, such a research use, internal use of models and commercialization. 

The authors welcome feedback on the paper, the license generator and the vocabulary. The Montreal Data License is a framework that aims to be inclusive, modular and standard-setting.