Rohan Bavishi's Headshot

PhD Candidate
University of California Berkeley
Department of Computer Science
Google Scholar

I’l be starting a new position at Adept AI Labs after summer!

Hi, I’m a fifth-year PhD candidate at the University of California, Berkeley, advised by Koushik Sen. I am a member of the Programming Systems group. My research focuses on designing tools and techniques for improving the productivity of programmers, specifically data scientists. Check out the project descriptions for DataButler, VizSmith, Gauss, and AutoPandas below for a primer on the kinds of problems in this space I’m interested in.

I’ve also researched techniques for automatic program repair in my internships with the PROSE team at MSR and the Software Systems Innovation Group at Fujitsu Research of America. At MSR, we developed techniques for fixing syntax errors in formula languages such as Excel and PowerApps. At Fujitsu, I worked on a system Phoenix, that leverages the rich development history of open-source projects on Github to automatically learn generic strategies to repair static analysis violations, such as those reported by Findbugs. The techniques behind Phoenix were published in FSE 2019.

I obtained my bachelor’s degree in computer science from the Indian Institute of Technology, Kanpur. My undergraduate research was focused on developing more precise bug localization techniques and was published in OOPSLA 2016. Here we combined model-checking techniques with soft invariants learned from regression tests to obtain better precision in reasoning about possibly faulty lines of code.

Feel free to contact me via email, or any of the platforms above. Cheers!
DataButler / Datana
Rich Natural-Language Interfaces for Data Science Code using Large Language Models

Recent advances in NLP, specifically the advent of large language models, have revolutionized program synthesis research. These models can output human-like given a textual context, which can be either natural language or existing/incomplete code.

While such advancements have opened up many exciting possibilities in using natural language as a modality for synthesis, it may be lacking when it comes to data science. What if a data scientist does not know how to express something in natural language that the models will pick up on? What if they do not know what is possible? What if they do not know what to do in the first place?

We combine code mining techniques and the code summarization ability of language models to build an autocompleting code search engine that allows scientists to use a handful of keywords to explore various possibilities along with previews. It also recommends next steps to perform when starting with a blank slate. Stay tuned for a demo!

Formula Repair for Excel/PowerApps
Combining the best of symbolic enumeration and language models for fast, precise repair

Excel is used by millions of people everyday. Maybe you’ve used it too. Have you ever written an Excel formula? If yes, chances are you’ve made some silly mistakes along the way - missing parentheses, forgetting a comma, leaving out an int-to-string transform etc. You may have also realized that Excel does not necessarily provide the most useful feedback regarding these errors - it does not point out the error location accurately every time, and often the error message is misleading. Can we do better?

In collaboration with the PROSE team at Microsoft, I worked on developing a formula repair technology that automatically suggests repairs for a faulty formula. Classical search-based repair techniques enumerate all possible repairs by leveraging the formal language specification. But these techniques often return too many repairs, and can be slow. We leverage the power of language models to bias the search towards likely error locations, and also rank the repairs by how similar it is to human-written formulas. Stay tuned for a release!

Synthesizing Visualization Code from Text Queries

Ever spent hours making plots using matplotlib or other visualization libraries? Such tools, although powerful, present a steep learning curve. Developers often end up searching on StackOverflow and copying and adapting code from answers. However this is non-trivial as understanding visualization code in the context of another data-set can be difficult and time-consuming.

VizSmith alleviates these issues by accepting both the data to visualize as well as a text query describing the visualization. VizSmith then seaches a database of visualization code snippets automatically mined from Kaggle to find the best fit and returns the produced visualizations along with readable, ready-to-use code. A manuscript is currently under submission. Try out VizSmith below!

Synthesizing Table Transformation Programs by Leveraging User Interaction

Plain I/O examples for synthesizing table transformations (see AutoPandas below) lose out on readily available information, such as computational relationships between input and output. Also such I/O tables can be cumbersome to provide!

Gauss alleviates this issue by offering a special UI that the user can use to construct the I/O example which allows Gauss to record precise information about the inputs and output. Gauss then employs novel graph-based reasoning to vastly improve both synthesis time and reduce the burden of the user by allowing them to provide partial input-output examples while still returning correct solutions. A manuscript is currently under review. Try out Gauss below!

Synthesizing Table Transformation Programs using Machine Learning

Pandas is a hugely popular Python library for table manipulation. However its size and complexity can be daunting to beginners.

AutoPandas helps automate coding in Pandas by generating Pandas code given an input table and the output table that should be produced. AutoPandas encodes the input-output table as a graph and leverages the recent advancements in graph neural networks to find the correct Pandas program. Links to the paper and code can be found in the OOPSLA 2019 paper. Try out AutoPandas below!

Learning to Repair Static Analysis Violations by Analyzing OSS

Static analysis tools help catch bugs in programs without having to execute them. Are they actually used? A 2017 Coverity Scan report estimates that 600k out of 1.1 million identified defects over 4600 OSS projects were fixed. However, there is still a large barrier to adoption because the defects have to be manually investigated, confirmed and fixed. Static analysis tools also report a large number of false positives.

Phoenix solves this problem by mining commit histories of large open source projects on Github for patches to static analysis violations reported by FindBugs. It then employs a novel program synthesis algorithm to generalize the patches into reusable repair templates or strategies which it then uses to fix new, unseen violations. Check out the FSE 2019 paper and the demo video below!

Publications Conference PublicationsTool PapersWorkshop PublicationsarXivDissertationsPatents