COLUMBIA UNIVERSITY COMS 6113

Important Dates

Percentages are of your total class grade.

Overview

The major portion of your grade is based on the research project. Students will organize into teams of 1-3 students and work on a research project. It should take about 3-4 weeks to complete. Some possible ideas are described below.

Teams should consist of 1-3 people. In addition, if you have a project in mind, please indicate briefly (1–2 sentences) what you are thinking. We have included a list of possible projects at the end of this document although you are not required to choose from these.

Good class projects can vary dramatically in complexity, scope, and topic. The only requirement is that they be related to something we have studied in this class and that they contain some element of research – e.g., that you do more than simply engineer a piece of software that someone else has described or architected. To help you determine if your idea is of reasonable scope, we will arrange to meet with each group several times throughout the semester.

Prospectus

Your research prospectus will contain an overview of the research problem, your hypothesis, first pass at related work, a description of how you plan to complete the project, and metrics to decide if it worked.

Your prospectus should follow the example:

Submission

  1. Rename the filename of your prospectus to the following format, last names should be in alphabetical order. prospectus_<UNI>_.._<UNIn>.pdf
  2. Click here to upload the file by 2/14 11:59PM EST

Mock PC Meeting and Paper Draft

We will spend two class sessions running a mock program committee (PC) meeting. The timeline for this will be as follows:

The Paper Draft

Your paper should be no more than 10 pages, and include at least the following sections:

  1. Introduction: motivate the problem, summarize related work, and declare your crisp hypothesis (or hypotheses). This should be fully present.
  2. Related Work: describe the state of the art in the most relevant research areas to your project. This should be fully present.
  3. Technical Overview: outline the technical approach you are taking so that the reader has an intuition about the solution. This should be fully present.
  4. Technical Details: considerably more technical details of your project, and details on what has been implemented. The details should be mostly complete, but may not be implemented yet.
  5. Experiments: describe the experimental setup as we have gone over in class. You may not have run experiments yet. If you have, feel free to include them.

In short, I expect that you have a much clearer idea about the problem and how it can be solved. Most of the technical details and relevant work should be clear, but you may not have implemented it yet.

Submission

Paper Reviews

Paper reviews address the same points as the roles that we have been focusing on. They are different in that you want to propose concrete points that are great, and points that can use improvement. Beyond that, study suggestions on how to review, and NOT review, papers:

PC Meeting

Preparation

For each paper, we will choose one or more of the reviewers as the “Shephard”. Their job is to:

PC Meeting Format

Paper Discussion Format

Shepherd Duties after PC Meeting

Post-PC Meeting Reflection (Due Sat 4/6 11:59PM)

Use this form to submit

Reflect on your own reviewing

  1. For each paper you reviewed:
    • What are the top 3 or more ways in which your review is similar to the other reviewers?
    • What are the top 3 or more ways in which your review is different to the other reviews?
  2. Overall, what are takeaways or lessons from this process that you can apply when you write reviews in the future?

Submit a reflection of your project

Report/Camera Ready (5/10 11:59PM EST)

You will prepare a conference-style report on your project with maximum length of 12 pages (10 pt font or larger, one or two columns, 1 inch margins, single or double spaced – more is not better.) Your report should expand upon your prospectus and introduce and motivate the problem your project addresses, describe related work in the area, discuss the elements of your solution, and present results that measure the behavior, performance, or functionality of your system (with comparisons to other related systems as appropriate.)

Because this report is the primary deliverable upon which you will be graded, do not treat it as an afterthought. Plan to leave at least a week to do the writing, and make sure your proofread and edit carefully!

Submission

Project Suggestions

The following are examples of possible projects – they are by no means a complete list and you are free to select your own projects. In fact, a common source of ideas is to take your experience from another domain, and combine it with databases/data management. Projects often come in several flavors:

  1. Make DataBass better: extend DataBass in a significant way, and evaluate it against other systems. For instance, support DSM/PAX, distributed execution, LLVM compilation, lineage, etc. Code quality matters for this option.
  2. Research project: model an unsolved problem, propose algorithmic solution, evaluate and report findings.
  3. Win: pick an existing useful application and a well-recognized metric (latency, prediction, etc) and win against the state of the art.
  4. Break and fix: implement a state of the art algorithm on real data, show that it doesn’t actually work (results are poor, it’s slow, etc), make it work.
  5. Evaluate: there are many options out there, it’s not clear which ones are actually best, and under what conditions. Run a bake-off and evaluate.
  6. Fill a gap: think about something useful that should be easily doable, but is painful or impossible with current state of the art. Fill that gap.

Precision Interfaces

Precision interfaces analyzes query logs and generates custom interaction components from the logs. The goal is to scalably generate dozens or hundreds of custom interactive analysis interfaces for any analysis found in a log.

Deep Neural Inspection

DeepBase is a system to perform deep neural inspection: it extracts hidden unit activations (or other types of behaviors) and computes the statistical relationships with user-specified hypotheses.

Lineage

Smoke is the fastest lineage-enabled database engine. It captures the relationships between output and input records as efficient lineage indexes. It turns out, this can be used to express and speed up interactive applications such as visualizations. Extend or use it in interesting ways

New Querying Interfaces

Scalable, Image, Databases are on the horizon. However, a major limitation is that the query interface is incredibly impoverished. How do you specify that you want to find red cars that move along a trajectory? Or to look for relationships between two objects over time? Certainly not by writing SQL-like text queries. The challenge is that video is fundamentally 3D, but query interfaces are 1D.

In-Network Query Processing

Contact Arpit Gupta if interested: glex.qsd@gmail.com

To keep the networks running, network operators need to monitor a wide range of network activities. For example, they need to concurrently detect whether the network is under attack and also determine whether there is a device failure in the network. This involves extracting multiple features from the traffic data and combining them to infer network events in real time.

Sonata (SIGCOMM’18) is an expressive and scalable telemetry system that coordinates joint collection and analysis of network traffic. Sonata provides a declarative interface to express a wide range of common telemetry tasks as dataflow queries. To scale execution, Sonata partitions each query across the programmable switch (e.g., Barefoot Tofino) and the stream processor—offloading as much data processing as possible to the switch. To optimize the use of limited switch memory, Sonata dynamically refines each query over time to ensure that available resources focus only on traffic that satisfies the query.

Possible directions:

Query-based Graph Visualization

Graphs are fundamentally high dimensional, and generating good graph visualizations is still an unsolved problem. There are plenty of ways to visualize a graph—as a matrix, as a node-link layout (with many mayn layout algorithms), as histograms, and so on. Suppose you know what analysis queries (e.g., recursive SQL queries, or a query workload) have been run on the graph. Can those queries be analyzed to recommend the appropriate visualization?

What We Talk About When We Talk About Data

How are data and analyses referred to and described in scientific work? When data is presented as figures or tables, how is it referred to? What are the verbs and nouns? Is there a universal set of ways that figures are described (e.g., in terms of comparisons? in relative terms? ). This can serve as the evidence for a new data analysis language. Analyze Viziometrics and ArXiV for their figures and captions and surrounding text (ArXiV provides LateX files)