Companies store copies of information in multiple locations to minimise the risk of data loss, but does our right to privacy suffer as a result?
• More from our series on big data
Proposed data protection laws would require companies to delete information that could allow an individual to be identified, but existing data storage and duplication practices are at odds with the drive to protect our right to privacy.
The 'right to be forgotten' is one of many concepts that could be introduced into new EU legislation on data protection later this year, but according to some within the data processing industry, most firms are still a long way from being able to comply.
Central to the idea of being digitally forgotten is the concept of anonymisation - stripping personally identifiable details out of data such that anyone coming into possession of it would be unable to trace the individuals to whom it refers.
Earlier this year a Harvard professor was able to re-identify individuals in a genetics database by cross referencing with public records, with an accuracy rate of 42% if only three types of information - zip code, date of birth and gender - were present, rising to 97% when first name or nickname - information that could easily be extracted from many email addresses - was added.
The Information Commissioner's Office has already released guidelines concerning anonymous data, but when the European Commission finishes refining the existing EU Data Protection Directive it is likely that more stringent legislation will be put in place.
The idea of simply deleting personally identifiable information from a database following the initial collection and analysis stage may not seem overly complex, nor would it stop a data processing company from carrying out supplementary analysis on the remaining data. But in reality the data security processes in place at many large organisations can make this task far from straightforward.
The challenge
In large businesses, a database is a live system, being used on a second-by-second basis, and as a result of this dynamicity, it has to be constantly backed up. Every time an entry is made into that database, a copy is made of that record.
"At some point in the day that entire database will be backed up - probably to at least two different locations, one of which will almost certainly be off site. Studies have shown over the years that a given organisation will have anything up to eight copies of any piece of data", said Bob Plumridge, regional CTO of Hitachi Data Systems.
"That's OK, until you want to strip things out of that data. You could take my name, my address, my email address and my telephone number out of the live system, you could probably take it out of the first copy, but how do you go about doing that with the six other copies - most of which are probably not even in the same location?", said Plumridge.
Plumridge believes businesses are acutely aware of the difficulties they may soon face, but that overcoming these challenges is proving more difficult.
"Businesses are increasingly concerned that not only is this data being used in a legally correct way, but also an ethically correct way. The last thing they want to do is launch into these products and then find they're in trouble with the ICO or the EU over data privacy.
"There is a lot of legal discussion around the 'right to be forgotten', but if such a policy were to be enacted tomorrow, in my view most organisations would not be able to comply with it", said Plumridge.
Minimising risk
Failing to adequately anonymise data could have severe consequences, both legally and in terms of brand reputation, but such risks can be mitigated by well-established strategies, according to Nick Millman, managing director at Accenture Digital.
"Where anonymisation is concerned, it comes down to having a clear design for how you're going to remove the sensitive information. As an organisation you should be testing the various different combinations of data points left in the database to make sure it's not possible to get back to an individual.
"One method is to use an algorithm to mask the data, so taking a valid postcode, for example, and converting it into a random set of characters. There are a number of different commercial packages that do this out of the box, or you could develop your own algorithms, but regardless of the process and technology chosen, what we really encourage is rigorous testing", said Millman.
Before such steps can be taken, however, an organisation must build up comprehensive picture of where every piece of data is stored, how many copies exist, where they sit and where any potentially critical metadata are held.
"Establishing a data lineage is a crucial part of the picture. This allows a company to trace back every piece of stored data to where it originated from, and also tracks any other places where it is stored", said Millman.
EU legislators are still revising proposed reforms to the Directive, many of which have been fiercely opposed by the UK, and it is hoped that a compromise can be reached by the European Parliamentary elections in May 2014.
Are we really restricted to either secure data storage or effective anonymisation, or is this a false choice? Join in the debate in the comment below, or contact me directly on Twitter at @jburnmurdoch or @GuardianData
