Build your own query language

While building a data analytics product that provided users a way to do EDA (exploratory data analysis), I soon realized that building standard apis was not going to scale well. Our platform had billions of data points with cardinality in the millions. Designing standard apis would just be too limiting to do meaningful analyses. Also, using them would be too slow from a UX perspective, as you would need to do a lot of clicking on the pages.

The goal was simple, we want users to be able to render many charts in different ways without having to click too much on the pages.

One of my colleagues had an idea to build something similar to what bloomberg terminal does. Bloomberg terminal allows you to query financial data using their own DSL (domain specific language). They even have a special keyboard to make you faster.

DSL was a perfect fit. We defined a small language on how to query the data you are interested in, and kept it extremely simple as our users were not primarily engineers. We got flexibility and speed of use, while retaining simplicity by not overdoing the DSL (clients might disagree).

On the other hand, we needed to be very careful to define some sane limits for the users, as they essentially choose how they are using our db. Even though you think you are prepared for it, you get surprised in the end.

Remember to keep the DSL simple, otherwise people might reject trying it. Our DSL was composed of the following base components:

  1. Delimiter -> filtering key
  2. Ticker -> filtering value
  3. Function -> value you are interested in
  4. Timeframe
[TICKER1, TICKER2]DELIMITER1 FUNCTION1, FUNCTION2 30D

would be translated to something like this (although we didn't use a sql
flavoured query language in our db)

SELECT function1, function2
WHERE delimiter1 IN (ticker1, ticker2)
AND created_at > now() - 30 days
ORDER BY created_at DESC

Since this was a crypto data analytics platform, best to give examples in that domain.

Get the number of daily active users for the specified chains in the past 90
days:
[ARB, AVAX, CRO, ETH, FANTOM, OPT, POLYGON, TRON]C DAU 90D

What is the 30-day volatility of all assets on ethereum chain in the stables
sector in the past 90 days:
[ETH]C [STB]SEC [*]A VOL_30D 90D

What is the total value locked on curve app on arbitrum and optimism chains in
the past 90 days:
[ARB, OPT]C [CRV]APP TVL 90D

Our DSL had additional components, like:

  1. Operators
  2. Aggregations
  3. Time bucketing
  4. ...

So you could do:

Transaction count percentage change for all assets in the stables sector:
[STB]SEC [*]A TXC% 90D

Price ratio of ether and ether derivatives:
[ETH]A P / [wETH, sETH, stETH, wstETH, frxETH]A P 90D

Average liquidity of usdc for the specified apps, averaged by chain (1 line per
chain)
[UNIV3, UNIV2, BALV1, BALV2, CRV]APP [USDC]A L AVG[C] 90D
example-chart
Example chart

With this approach, you could get beautiful charts pretty quickly, and iterate over them quickly. Query building was quite interesting, I would recommend starting with:

  1. Defining your BNF grammar
  2. Parsing it into internal structures
  3. Building queries from those structures (builder pattern does wonders here)

I also built an llm integration in the early days of chatgpt that would transform natural language text to our dsl. Unfortunately it never went live.

The product was shut down, but if you're interested in what was built, you can check the docs and the official blog.