— Written by Łukasz Stefaniak — 11/3/2023

Better parsing and support for column-level lineages

This week, we made significant progress on improving our parser and analysis coverage for Snowflake, BigQuery, Redshift and Clickhouse.

Continued work on column-level lineage

We added DDL retrieval for the Snowflake data warehouse to enhance code navigation. Our parser now correctly recognises SAMPLE | TABLESAMPLE queries and CLUSTER BY in table creation. We also made many changes to handle DWH-specific edge cases, such as fully quoted identifiers in BigQuery, SQL functions with the same name but flipped arguments, and some generally reserved keywords being accepted by Snowflake or specific rules of lateral column handling.

We added support for PIVOT and UNPIVOT operations, which now correctly track multiple aggregations (in DWHs supporting it). We also added support for JOINs to subqueries.

As we get closer to our planned release, we are focusing on ensuring the quality of the produced lineage results. To achieve this, we extended our test suite to cover more exotic examples of SQL syntax and added automatic coverage report generation. This allows us to track parsing/analysis completeness and prevent future regressions. With the help of Synq monitors, we track changes to the syntax coverage.

This week, we also started work to enhance our analysis with information about known tables and their known columns. This will significantly improve the handling of SELECT * FROM foo UNION ALL SELECT * FROM bar or other cases when SQL wildcard is used.