Type to search

Article Europe Featured

Legal data – decentralised, disorganised, disrupted?


Challenges and opportunities in legal data analytics

The legal industry sits on top of mountains of data, collected in various locations ranging from in-house contract management tools to government databases and platforms by commercial providers. According to a House of Commons briefing paper, 1.47 million criminal cases were disposed of in Magistrates’ Courts in 2018, and 1.29 million judgments were made on civil claims in County Courts. It would be hard to estimate the number of contracts currently in existence, but according to Factor, a legal technology provider, just those in need of repapering in relation to LIBOR’s upcoming discontinuation run in the low millions. 

The sheer quantity of data is a good sign – there are many valuable insights which could be derived from it, and carefully engineered legal data analytics can make legal services cheaper, quicker and more accurate. However, legal data is still highly unstructured and decentralised compared to data in other sectors, which raises a lot of challenges for legal analytics.

An area which is potentially ripe for automation, but suffers acutely from the problem of unstructured and decentralised data, is litigation analytics. Litigation analytics is a broad term whose meaning still largely depends on the context and the user, but broadly it means the use of descriptive or predictive analytics during the process of dispute resolution and pre-dispute risk assessment. In short, it is using data to make smarter decisions on issues like costs and the likelihood of success of a particular claim.

Judicial analytics

Predicting judicial decisions using statistical methods is not an entirely novel idea – it has existed in one form or another since the first wave of enthusiasm about AI in the 1950s. However, multiple iterations of Moore’s law and a massive increase in storage capacity, coupled with highly successful use-cases of data analytics in other industries, have only recently pushed the idea into the mainstream. In the United Kingdom, legal information heavyweights like Thomson Reuters and LexisNexis are already leveraging their comprehensive indexed databases to offer judicial analytics, but that is not the case across all jurisdictions. Comprehensive databases with rich metadata are not always the norm and open source data on judicial decisions can be scattered across multiple different websites and can be archived in a variety of different formats, not all of which are machine-readable. Significant housekeeping is still necessary in order to perform sufficiently high level analytics in those markets, and that is an opportunity both for large legal information companies, and for incumbent judicial analysis providers.

Analysing Alternative Dispute Resolution

However, litigation is only the last and not always a necessary step in a long dispute resolution process. If one wants to use all available data to advise a client about the best course of action before proceeding to litigation, it becomes necessary to analyse settlement agreements and arbitral decisions. The difficulty is that they are often locked behind strict confidentiality protocols and are held in a very decentralised way, with each law firm maintaining its own confidential dataset (which may or may not be organised and labelled appropriately). 

This decentralised and potentially disorganised data is a challenge, but it also creates opportunities for technology start-ups and existing information services providers. 

One of the possible solutions is to use natural language processing algorithms which have already been trained on datasets like the English Wikipedia. This can ensure confidentiality, but law firms and in-house legal teams need to be wary of the gaps between general-purpose English and legal English and the impact this can have on the results. A possible alternative is to use models which can be trained on smaller datasets. This can also help law firms or in-house legal teams keep their analytics confidential, but it still runs the risk of producing unreliable results if the data it is trained on is somehow anomalous.  

The state of the data in ADR demands consideration of the possible ways of engineering around the problem of decentralised and disorganised data, and the opportunities which legal technology might capitalise on in the future. But it also highlights how important it is for law firms and in-house legal teams to improve their data literacy and be aware of the limitations of their data sets.

Access to justice

In those parts of our legal system where the main stakeholders are the government, universities and legal aid providers and there are fewer incentives to keep data proprietary and confidential, it is possible to use a more ambitious model for legal data analytics: the legal data commons proposed by Margaret Hagan, Jameson Dempsey and Jorge Gabriel Jiménez. This draws upon data sharing strategies in the social and natural sciences, and proposes a centralised body of data which can be used for research and development in legal AI, with a particular focus on improving access to justice and the state of our legal system. 

Such a solution could not only solve the problem of unstructured and decentralised legal data in this corner of the law, but could significantly improve our understanding of the problems in our legal system. The amount of structured data such an initiative could provide could be used to improve judicial analytics, and help create NLP algorithms trained on legal text as opposed to general-purpose English.

The road ahead

Legal data analytics holds a lot of promise. Data analytics can supplement human lawyers to give law firms and in-house legal teams the speed, cost and agility demanded by the legal market of the future, and they can help us understand the problems in our legal system and improve access to justice. There are a lot of opportunities for start-ups and existing companies in this field, but capitalising on the hype has to start with the boring task of improving our data literacy and cleaning up and consolidating our data.

Ralie Belcheva