AWS Redshift is the biggest cloud data warehouse in the world, with 10,000’s of clusters, and revolutionized data warehousing. In today’s world where more and more people within organizations have access to data warehouses in order to perform data analytics, there is a growing need for fine-grained access control in those data warehouses.
The need is mainly due to Compliance issues and security. Organizations wants to reduce the risks that are elevated because of the elevated data access, and part of it is by limiting access to certain data that exists in data warehouses.
This guide below is taken partly from a guide I wrote about AWS Redshift security, which covers this, but also other topics such as auditing and monitoring, column based security and more.
What’s Row level access control?
Row level access control (AKA Row level security) means that, in certain tables, certain users will only be able to access specific data items (Rows). These rows should be filtered, based on the value of one of the columns.
This filtering defines which role can access the specific item (row). In most cases, this additionally requires a translation or lookup table to define the roles against the type of items they can access within the specific table.
- A certain role can only access items from a certain region
- Certain roles can only access items without certain sensitive data
- A table with data of several teams (Some of them can even be external to an organization), and each one can only access their own data.
Setting up rows based security in Redshift: a POC
Setting up row level access control in Redshift is not entirely straightforward. However, it’s certainly doable. Let’s see how we can achieve that:
In our sample, we have a table called employees, which contains details about the employees in each department. Let’s create the table and populate it with employees from different regions:
CREATE TABLE department_employees (
region varchar(50));INSERT INTO department_employees VALUES
(1, 'Jina Jay', '+1–212–5555555', 180, 'us'),
(2, 'Keena Kay', '+1–212–5556666', 240, 'us'),
(3, 'Fiona Newfoundland', '+1–709–5555555', 210, 'ca'),
(4, 'Ben Labrador', '+1–709–5556666', 270, 'ca');
Now, we’re getting a requirement from our beloved management to only allow accountants from the specific regions to be able to see employee details from those regions. In this example, only accountants from Canada will be able to watch the details of employees from the great origin of the Labradors and Newfoundland breeds.
Unfortunately, in Redshift, it is not possible to mix metadata which is stored in the leader node (Such as correlating CURRENT_USER_ID or CURRENT_USER against pg_user & pg_group) with the query of the entire cluster. Instead, I’m adding a table mapping each user to their respective roles :
CREATE TABLE users_to_groups
(user_name varchar(100), group_name varchar(100));
INSERT INTO users_to_groups VALUES
('ca_accountant', 'ca');/* Let's also create an accountant user */
CREATE USER ca_accountant WITH PASSWORD 'xyzzy-1-XYZZY';
Now that we have the table and the user, we create a view, which will pull information from the department_employees, filtered according to users_to_groups, and grant SELECT privilege to the user marketing_accountant:
CREATE VIEW v_department_employees AS
SELECT * FROM department_employees
WHERE department IN (SELECT user_name, group_name FROM users_to_groups WHERE user_name=CURRENT_USER);/* Granting access to the user in views */
GRANT SELECT ON users_to_groups TO ca_accountant;
GRANT SELECT ON v_department_employees TO ca_accountant;/* Switching to use the context of the user 'ca_accountant' */
SET SESSION AUTHORIZATION ca_accountant;
SELECT * FROM department_employees;/* We get a permission denied error, as we don't have access to the table itself:
Invalid operation: permission denied for relation department_employees */
SELECT * FROM v_department_employees;/* We now get the filtered rows */
After doing this, we now get a filtered view of only the items we should see, i.e. only the Canadian employees.
While this is just a PoC, this implementation works. When testing, it is best to use the User ID instead of User Name for performance reasons. However, you can also enforce the same logic via a UDF (User Defined Function) to incorporate in the views or use a BI tool, such as QuickSight, to enforce row level security. The latter will limit you to the data you query through that BI overlay.
This is just an example, and I assume you will need to apply your own login on it. I’ve taken the examples from a Redshift security guide I wrote for Satori, where we solve the problems of granular access controls for different data stores, as well as provide visibility & control for enterprises over their data stores and the sensitive data sitting there.
Though setting up row based security on AWS Redshift is possible, it may be complicated to set up and maintain, and when use-cases are more diverse, it may become non-realistic. In Satori we provide granular data access to data warehouses, databases and data lakes. The access control we provide can be based on the data queried (For example: allowing only specific roles to access PII), and by providing this outside of the data store itself we detach the security from the operation of the data store. If you’d like to learn more about our product, visit our website where you can also try out our product.
If interested, I also wrote a how-to guide for setting up Snowflake row level security.