Cracking the Code
Lawyers are getting their heads around predictive coding to help them find relevant documents quickly and efficiently
There has been a lot of discussion recently around the use of predictive coding, or Technology Assisted Review (TAR) as a way to assist litigators to find relevant documents quickly and efficiently.
For those who may not have heard about predictive coding, it is the use of software that can assist lawyers to review documents in a large electronic document set, quickly and effectively, compared to a traditional manual review.
One way the software works, is that it can ‘learn’ which documents are relevant based on inputs from the review team.
The use of this software is becoming widespread, especially in the US, where there has been judicial endorsement of the use of the software in an effort to reduce the costs of discovery.
Of course, many lawyers are wary of software that attempts to say it can identify relevant documents when every document has not been reviewed by a human.
The possibility that a highly relevant document might be missed is always at the forefront of a lawyer’s mind, however, the reality is that with electronic documents, there are often simply too many documents in the initial pool to be reviewed by human reviewers, and the cost of doing so is simply not proportionate to the claim.
One issue with electronic documents compared to hard copy documents, is that there is often many more of them; the advantage however is that electronic documents can be searched and clever technology is continually being developed to handle large volumes of data.
TAR has been used for several years now in electronic discovery, especially in the US. However, some commentators are now questioning whether potentially relevant documents are being excluded by the need to cull large volumes of documents, not because the technology is ineffective, but rather, because there is little expertise in the application of processes used in conjunction with the technology to ensure effective results.
In the US, TAR is not only used widely, there is judicial encouragement for the use of TAR.
Indeed, in one matter, the court set out straightforward process to enable the lawyers to find a control set of documents and also how to use the control set to train the system to find similar documents.
The court can also review the statistics produced by the system in order to determine the effectiveness of the process.
This hands-on approach by the courts, however, is not the norm: usually courts have a hands-off approach and let the parties define the process.
However, without guidance from the courts, it means the process is somewhat nebulous and uncertain, especially as this is a new and emerging area.
Call in the experts If the process is left to the parties who have little or no expertise, not only in the use of such systems, but the processes that should be put in place, then who does have such expertise? The Text REtrieval Conference (TREC) supports research for large-scale evaluation of text retrieval methodologies and has undertaken several studies on the use of systems for information retrieval in discovery.
In particular, a study was conducted in 2011 which demonstrated that the use of technology in legal review was more effective than manual review.
The author of a review of that study, Maura R.
Grossman has more recently conducted a further review of the best way in which the use of TAR should be conducted.
The study looked at three types of TAR tools: Continuous Active Learning (CAL), Simple Active Learning (SAL) and Simple Passive Learning (SPL). Essentially, all three use TAR to assist in ‘training’ the system to find relevant documents based on which documents the legal team code as ‘relevant’.
Each method uses a process whereby a set of documents (the ‘training set’), say 1,000 documents, is coded by a senior lawyer as ‘relevant’ or ‘not relevant’ which the system then uses to ‘learn’ which other documents might be relevant as well.
This process is repeated several times until the review team is satisfied that a sufficient level of relevant documents has been found.
The difference between the three processes is whether randomly selected documents are used, or whether the set of documents has been located via a non-random method such as using basic keyword searching.
In the CAL method, the 1,000 documents are selected using keyword searches and then the documents that are coded by the lawyer are used to train a learning algorithm, which scores each document in the collection by the likelihood of it being relevant.
In SAL, the set of documents can be selected randomly or non-randomly, but then subsequent document sets for coding by the reviewer are selected based on those about which the learning algorithm is least certain.
With SPL, the document set is selected randomly and relies on the review team to work on an iterative basis until there is some certainty that the review set is ‘adequate’.
The study concluded that when keyword searches are used to select all of the training sets, the result was superior to that achieved when a random selection is used, and summed up that “random training tends to be biased in favour of commonly occurring types of relevant documents, at the expense of rare types.
Non-random training can counter this bias by uncovering relevant examples of rare types of documents that would be unlikely to appear in a random sample”. Such studies are extremely valuable In learning how best to use this technology, however, further guidelines and endorsement from the courts would be welcome.
Allison Stanfield is the director of e.Law International, a legal consulting and legal technology service provider specialising in_e-discovery.