COLUMBIA UNIVERSITY COMS 6113

Important Dates

Percentages are of your total class grade.

Overview

The major portion of your grade is based on the research project. Students will organize into teams of 2-3 students and work on a research project. It should take about 3-4 weeks to complete. Some possible ideas are described below.

Teams should consist of 2-3 people (we may adjust this depending on the size of the class). In addition, if you have a project in mind, please discuss it with Eugene well ahead of time. We have also included a list of possible projects at the end of this document although you are not required to choose from these.

Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research – e.g., that you do more than simply engineer a piece of software that someone else has described or architected. To help you determine if your idea is of reasonable scope, we will arrange to meet with each group several times throughout the semester.

Choosing a research problem is very difficult, especially if you have not done so before. You may end up thinking of, and discarding many possibilities before finding the project you ultimately work on. Have a fuzzy idea? Want some feedback or help brainstorming a project? Come to office hours and/or recitation. The staff are all here to help!

Proposal

Your research proposal will contain an overview of the research problem, your hypothesis, first pass at related work, a description of how you plan to complete the project, and metrics to decide if it worked.

We have setup a template for your proposal on overleaf. Clone it into your team’s account to edit it. Make sure to change the title and author names, and include your team members’ UNIs.

Submission

  1. Use the proposal template on Overleaf
  2. Click here to submit

Paper Draft

You will submit a draft of your paper that should be between 4 – 6 pages. Please use the overleaf report template to get started. It already contains a scaffold of sections, and suggestions of what to include in each section. Your report is not beholden to these sections, so take the template as a starting point.

For the draft, you should have a fleshed out introduction, related work, and technical overview. You should have a clearer idea of the technical details than from the proposal, but need not have implemented it yet. You do not need to have run experiments yet, but should sketch out the hypotheses and the potential experiment setup (which may change). If you have preliminary findings, that’s great! Please include those.

In short, I expect that you have a much clearer idea about the problem and how it can be solved. Most of the technical details and relevant work should be clear, but you may not have implemented it yet.

Tips:

Submission

  1. Use the report template on Overleaf
  2. Click here to submit your 4 – 6 page draft

Project Showcase

Your team will prepare and present a project poster at the end-of-course showcase session. This gives you an opportunity to present a short presentation demo of your work and show what you have accomplished in the class!

Your presentation should be polished. Since there is still time until the final report, you are encouraged to also discuss ideas or challenges you are still considering.

Since you are presenting to your peers as well, make sure you give your colleagues enough context to understand your ideas. In addition to what you did, help your colleagues understand why you made your specific choices, and provide examples. It’s better to make sure the audience learns a few specific ideas than try to say everything. Come to office hours or contact the staff if you would like feedback.

Overall logistics

Your presentation should cover (in content, not necessarily one slide for each point)

Submission

Report/Camera Ready

You will revise your draft and submit a conference-style report on your project, with maximum length of 12 pages. Because this report is the primary deliverable upon which you will be graded, do not treat it as an afterthought. Plan to leave at least a week to do the writing, and make sure your proofread and edit carefully!

Submit here. Name your file UNI_UNI.pdf

Project Types

There are two broad categories of projects, the Reproduce and Extend Project, and the Research Project.

Reproduce and Extend

The goal of this type of project is to deeply understand how an existing problem is solved, and then slightly improve it. Once you pick a problem (basically, pick anything we mentioned in class, or any component of a system), read the literature to understand the different ways it has been solved. This should require doing a literature search by using Google Scholar, surveys, and following citations. You should expect to read 5-10 or more papers, and to summarize them in your related works section.

Since the best way to understand something is to implement it, you will then pick the best-in-class approach from your readings and implement it in a system. For a database algorithm or component, DataBass may be sufficient, or you may choose some other system. You will then benchmark your implementation and report on the results.

After coding, debugging, benchmarking, debugging, laughing, crying, debugging the approach, you should be intimately familiar with the problem. At this point, introduce an improvement. This can be a special case for a particular class of queries or data distribution, or exploiting a hardware optimization, or leveraging some special property of a use case. Once you implement the improvement, update your benchmarks and explain the results.

The final report should still be the same structure. Naturally, the introduction will focus less on novelty, and more on how the system that you modified did not take advantage of your implementation, and the motivation behind your improvement. In contrast, your related work will be a detailed history and account of how this problem has been solved in the past and their pros and cons.

Research Project

The goal of this type of project is to identify a new problem, propose an algorithmic solution, and evaluate and report the findings. Its primary difference from the previous type of project is that it starts from a novel problem and seeks to solve it using whatever means make sense, whereas Reproduce and Extend Projects start with an established problem and solution.

There are many sources of potential research projects. Here are some ideas:

The following are open source DBMSes that you can use to implement your projects:

Project Suggestions

Open Ended Ideas

In-DBMS ML: extend a DBMS to run statistical libraries/ML models/data analysis operations. Evaluate against using a library external to the DBMS. See projects like MadLib, PeekingDuck, MindsDB, Google BigQuery ML.

Predict Query Results: given database statistics and a query, can a model predict the query results? For what databases and classes of queries is this possible?

Can ML Do Data Analysis? Data cleaning, preparation, and augmentation is one of the major challenges in data analysis and machine learning. Dive into one task (e.g., data transform, value imputation, error detection, etc) and evaluate the efficacy of large language models across a variety of dataset domains. See Can Foundation Models Wrangle Your Data? for inspiration.

Learned Webviews for databases. One perspective of most websites is that they are simply materialized views – Amazon’s product page issues a query for a given product, and developers manually determine a good layout to render the query result. Is it possible, given a database, to automatically generate a web view for a user’s query? What if the user says “make the results look like ”.

ML for Systems: different variations of RL are used to optimize system components, however the signals that these components have access to are too low level. For instance, a learned index uses the sequence of reads and writes to adjust its internal layout. Instead, it makes sense to use semantics from higher layers in the system (say, the buffer manager or query structur). Show tha using such “higher layer” hints are effective for optimizing learned components and propose an API that system developers might use to register such hints.

Progressive Cubes: Progressive encoding transforms a data item into a sequence of bytes where any prefix is an approximate result. JPEG is an example of progressively encoded data, a few bytes shows a low quality image, and adding a few more bytes improve the quality. Similarly, a breadth-first ordering of a secondary index is a progressive encoding – larger prefixes correspond to a deeper tree that is more effective at filtering. Is it possible to compute and return a progressively encoded data cube, where larger prefixes correspond to finer granularities for the useful dimensions? In other words, can you devise an algorithm that returns a progressively encoded data cube in less time than computing the cube and then encoding it?

Robust Cardinality Estimation Benchmark. One of the challenges with benchmarking cardinality estimation techniques is that it’s easy to over fit to a dataset/query workload and do well. It’s also hard to understand end-to-end performance implications. A potential strategy is to use data integration to construct a robust and large benchmark out of many existing benchmarks. The idea is to take existing benchmarks A and B, and use data integration to translate the queries for benchmark A to run on the data for benchmark B. Data integration itself is ambiguous, so it may generate K different schema mappings. Thus given N benchmarks, we could create N^K dataset-query combinations. Use this idea to evaluate a range of cardinatily estimation techniques.

Build on Existing Projects

The following build on existing projects in the Database Research Group.

Jade: Physical Visualization Design

Interactive visualization interfaces are critical in data analysis. To ensure the interfaces are responsive in the face of large and growing data sizes, developers need to make architectural and systems optimization decisions.  Jade is an ongoing project which automatically generates an optimized database backend for interfaces to guarantee the interactive latency constraints. Under the hood, Jade takes a specification of the interface, automatically picks the optimal data structures, and decides on the data placement to satisfy the memory and latency constraints. Here are two research directions. 

  1. For interactive visualization interfaces(one example: oil price), the backend SQL query workload can be optimized by precomputing different data structures(e.g., B-tree, Hash Index, R-Tree, cubes, etc.) to shorten the query latency. A key challenge is that a particular interface query may have multiple logically compatible data structures with different memory occupancies and query latencies. This phenomenon is very common in cube/data tiles. This project will study when a specific query workload over a database can match some or all these data structures - generic data cube, Immens, Falcon, Hashedcube, Nanocube, how to decide which data structure to use. If interested, feel free to contact Yiru at yiru.chen@columbia.edu.

  2. We are implementing a query execution framework that spans the database, server, and browser, enabling free placement of parts of the query plan, and are adding different data structures to suit different interfaces and queries. For a course project, come up with some interface use cases that might plausibly decide between some of these data structures, implement them in our execution framework, compare with baselines, and profile their performance to determine when they are suitable. If interested, contact Jeff at jat2164@columbia.edu. Note: project 2 is implemented in Rust, which can be challenging to learn.

JoinBoost Projects

JoinBoost is an In-DBMS ML project that scales gradient boosted trees to massive datasets and normalized schemas that can run in any DBMS. See the GitHub repo for details. If interested, contact Zach at zh2408@columbia.edu.

  1. Cloud Database Evaluation for JoinBoost: JoinBoost is currently designed for local databases (e.g., duckdb). However, cloud databases often have higher latency but better concurrency. This may be suitable for random forests, where all trees can be trained together. It is worth evaluating the use of JoinBoost in cloud databases and implementing any necessary optimizations.

  2. Automatic Feature Augmentation for JoinBoost: Feature augmentation is a common technique in data science (e.g., upgini). JoinBoost is scalable to the number of features, so it is worth exploring the possibility of extending it to automatically augment data and evaluating the performance improvements on kaggle competitions.

  3. Interpretability of JoinBoost: Explainable boosting machines (EBMs) are popular when human understanding of the machine learning model is necessary. It is worth studying the methods used in EBMs and implementing their integration into JoinBoost.

Fast Lineage in Fast Analytical DBMSes

SmokedDuck uses a program analysis framework to analyze data intensive programs and recommend instrumentation points in the code that are sufficient to reconstruct tuple-level lineage. These reduce the run-time overhead of capturing lineage from >1000x to ~5%, making lineage practical for real applications. SmokedDuck is implemented on top of DuckDB. If interested, contact Haneen at ham2156@columbia.edu

  1. Help integrate the analysis framework with LLVM to automatically analyze programs
  2. Apply the techniques to another vectorized data system such as DataFusion or Monetdb or Velox.
  3. Evaluate the trade-off and synergies between logical query rewrites for provenance and physical instrumentation in streaming dataflow systems like differential dataflow/materialized. You can get pretty far by simply performing logical rewrites of the dataflow operators without modifying the implementation of the dataflow system.

FaDE is a compiler that uses lineage for a query to generate efficient code to quickly evaluate deletion interventions (“if we removed users with age < 18, how would the result change?”). The goal is to evaluate millions of interventions per second. This is useful (among many use cases) for what-if analysis and explaining errors in query results.