Recent advances in NLP, specifically the advent of large language models, have revolutionized program synthesis research. These models can output human-like given a textual context, which can be either natural language or existing/incomplete code.
While such advancements have opened up many exciting possibilities in using natural language as a modality for synthesis, it may be lacking when it comes to data science. What if a data scientist does not know how to express something in natural language that the models will pick up on? What if they do not know what is possible? What if they do not know what to do in the first place?
We combine code mining techniques and the code summarization ability of language models to build an autocompleting code search engine that allows scientists to use a handful of keywords to explore various possibilities along with previews. It also recommends next steps to perform when starting with a blank slate. Stay tuned for a demo!
Excel is used by millions of people everyday. Maybe you’ve used it too. Have you ever written an Excel formula? If yes, chances are you’ve made some silly mistakes along the way - missing parentheses, forgetting a comma, leaving out an int-to-string transform etc. You may have also realized that Excel does not necessarily provide the most useful feedback regarding these errors - it does not point out the error location accurately every time, and often the error message is misleading. Can we do better?
In collaboration with the PROSE team at Microsoft, I worked on developing a formula repair technology that automatically suggests repairs for a faulty formula. Classical search-based repair techniques enumerate all possible repairs by leveraging the formal language specification. But these techniques often return too many repairs, and can be slow. We leverage the power of language models to bias the search towards likely error locations, and also rank the repairs by how similar it is to human-written formulas. Stay tuned for a release!
Ever spent hours making plots using matplotlib or other visualization libraries? Such tools, although powerful, present a steep learning curve. Developers often end up searching on StackOverflow and copying and adapting code from answers. However this is non-trivial as understanding visualization code in the context of another data-set can be difficult and time-consuming.
VizSmith alleviates these issues by accepting both the data to visualize as well as a text query describing the visualization. VizSmith then seaches a database of visualization code snippets automatically mined from Kaggle to find the best fit and returns the produced visualizations along with readable, ready-to-use code. A manuscript is currently under submission. Try out VizSmith below!
Plain I/O examples for synthesizing table transformations (see AutoPandas below) lose out on readily available information, such as computational relationships between input and output. Also such I/O tables can be cumbersome to provide!
Gauss alleviates this issue by offering a special UI that the user can use to construct the I/O example which allows Gauss to record precise information about the inputs and output. Gauss then employs novel graph-based reasoning to vastly improve both synthesis time and reduce the burden of the user by allowing them to provide partial input-output examples while still returning correct solutions. A manuscript is currently under review. Try out Gauss below!
Pandas is a hugely popular Python library for table manipulation. However its size and complexity can be daunting to beginners.
AutoPandas helps automate coding in Pandas by generating Pandas code given an input table and the output table that should be produced. AutoPandas encodes the input-output table as a graph and leverages the recent advancements in graph neural networks to find the correct Pandas program. Links to the paper and code can be found in the OOPSLA 2019 paper. Try out AutoPandas below!
Static analysis tools help catch bugs in programs without having to execute them. Are they actually used? A 2017 Coverity Scan report estimates that 600k out of 1.1 million identified defects over 4600 OSS projects were fixed. However, there is still a large barrier to adoption because the defects have to be manually investigated, confirmed and fixed. Static analysis tools also report a large number of false positives.
Phoenix solves this problem by mining commit histories of large open source projects on Github for patches to static analysis violations reported by FindBugs. It then employs a novel program synthesis algorithm to generalize the patches into reusable repair templates or strategies which it then uses to fix new, unseen violations. Check out the FSE 2019 paper and the demo video below!