It may be revolutionising the way we do business - but is Big Data secure? Guillermo Lafuente offers much-needed advice and guidance
The biggest challenge for big data from a security point of view is the protection of user’s privacy. Big data frequently contains huge amounts of personal identifiable information and therefore privacy of users is a huge concern.
Because of the big amount of data stored, breaches affecting big data can have more devastating consequences than the data breaches we normally see in the press. This is because a big data security breach will potentially affect a much larger number of people, with consequences not only from a reputational point of view, but with enormous legal repercussions.
When producing information for big data, organizations have to ensure that they have the right balance between utility of the data and privacy. Before the data is stored it should be adequately anonymised, removing any unique identifier for a user. This in itself can be a security challenge as removing unique identifiers might not be enough to guarantee that the data will remain anonymous. The anonymized data could be could be cross-referenced with other available data following de-anonymization techniques.
When storing the data organizations will face the problem of encryption. Data cannot be sent encrypted by the users if the cloud needs to perform operations over the data. A solution for this is to use “Fully Homomorphic Encryption” (FHE), which allows data stored in the cloud to perform operations over the encrypted data so that new encrypted data will be created. When the data is decrypted the results will be the same as if the operations were carried out over plain text data. Therefore, the cloud will be able to perform operations over encrypted data without knowledge of the underlying plain text data.
While using big data a significant challenge is how to establish ownership of information. If the data is stored in the cloud a trust boundary should be establish between the data owners and the data storage owners.
Adequate access control mechanisms will be key in protecting the data. Access control has traditionally been provided by operating systems or applications restricting access to the information, which typically exposes all the information if the system or application is hacked. A better approach is to protect the information using encryption that only allows decryption if the entity trying to access the information is authorised by an access control policy.
An additional problem is that software commonly used to store big data, such as Hadoop, doesn’t always come with user authentication by default. This makes the problem of access control worse, as a default installation would leave the information open to unauthenticated users. Big data solutions often rely on traditional firewalls or implementations at the application layer to restrict access to the information.
Big data is a relatively new concept and therefore there is not a list of best practices yet that are widely recognized by the security community. However there are a number of general security recommendations that can be applied to big data:
The main solution to ensuring that data remains protected is the adequate use of encryption. For example, Attribute-Based Encryption can help in providing fine-grained access control of encrypted data.
Anonymizing the data is also important to ensure that privacy concerns are addressed. It should be ensured that all sensitive information is removed from the set of records collected.
Real-time security monitoring is also a key security component for a big data project. It is important that organizations monitor access to ensure that there is no unauthorised access. It is also important that threat intelligence is in place to ensure that more sophisticated attacks are detected and that the organizations can react to threats accordingly.
Organizations should run a risk assessment over the data they are collecting. They should consider whether they are collecting any customer information that should be kept private and establish adequate policies that protect the data and the right to privacy of their clients.
If the data is shared with other organizations then it should be considered how this is done. Deliberately released data that turns out to infringe on privacy can have a huge impact on an organization from a reputational and economic point of view.
Organizations should also carefully consider regional laws around handling customer data, such as the EU Data Directive.
In the past, large data sets were stored in highly structured relational databases. If you wanted to look for sensitive data such as health records of a patient, you knew exactly where to look and how to access the data. Also, removing any identifiable information was easier in relational databases. Big data makes this a more complex process, especially if the data is unstructured. Organizations will have to track down what pieces of information in their big data are sensitive and they will need to carefully isolate this information to ensure compliance.
Another challenge in the case of big data is that you can have a big variety of users each needing access to a particular subset of information. This means that the encryption solution you chose to protect the data has to reflect this new reality. Access control to the data will also need to be more granular to ensure people can only access information they are authorise to see.
The main challenge introduced by big data is how to identify sensitive pieces of information that are stored within the unstructured data set. Organizations must make sure that they isolate sensitive information and they should be able to prove that they have adequate processes in place to achieve it. Some vendors are starting to offer compliance toolkits designed to work in a big data environment.
Anyone using third party cloud providers to store or process data will need to ensure that the providers are complying with regulations.
Security is a process, not a product. Therefore organizations using big data will need to introduce adequate processes that help them effectively manage and protect the data.
The traditional information lifecycle management can be applied to big data to ensure that the data is not being stored once it is no longer needed. Also policies related to availability and recovery times will still apply to big data.
However organizations have to consider the volume, velocity and complexity of big data and amend their information lifecycle management accordingly.
If an adequate governance framework is not applied to big data then the data collected could be misleading and cause unexpected costs.
The main problem from a governance point of view is that big data is a relatively new concept and therefore no one has created procedures and policies.
The challenge with big data is that the unstructured nature of the information makes it difficult to categorize, model and map the data when it is captured and stored. The problem is made worst by the fact that the data normally comes from external sources, often making it complicated to confirm its accuracy.
What organizations need to do is to identify what information is of value for the business. If they capture all the information available they risk wasting time and resources processing data that will add little or no value to the business.