Artur Andrzejak has received a PhD degree in computer science from ETH Zurich in 2000 and a habilitation degree from FU Berlin in 2009. He was a postdoctoral researcher at the HP Labs Palo Alto from 2001 to 2002 and a researcher at ZIB Berlin from 2003 to 2009. He was leading the CoreGRID Institute on System Architecture (2004 to 2006) and acted as a Deputy Head of Data Mining Department at I2R Singapore in 2010. Since 2010 he is a professor at Ruprecht-Karls-University of Heidelberg and leads there the Parallel and Distributed Systems group. His research interests include reliability of complex software systems, scalable data analysis, and cloud computing.
Data science is used in a growing number of scenarios by an expanding variety of users, ranging from non-programmers to business analysts to scientists. Some of these scenarios pose elevated demands on the
software solutions for data management, integration, processing, and analysis. In particular, these solutions have to bridge the trade-offs between scalability and reliability as well as between flexibility and ease-of-use.In this talk we will discuss selected aspects of engineering software systems and tools for data science in context of the above-mentioned challenges and scenarios. We will present approaches and tools for assisting users in programming scalable data analysis tasks and improving their programming efficiency. We will present an approach for accelerated implementation of Domain Specific Languages (DSLs) for data science. DSLs raise the level of abstraction of programming, are able to support the Domain-Driven Design development process, and facilitate communication with domain experts. In addition to smooth integration and support for popular frameworks/libraries like Pandas/scikit-learn or Apache Spark, our approach supports designing
project-specific or ad-hoc DSL elements tailored to even small application domains.
In remainder of the talk we will take a look at alternative approaches for accelerated programming in context of data analysis: dataﬂow programming systems like KNIME/Orange/LabView/Simulink, and the recently emerging technologies for program synthesis. We will briefly overview the state-of-the-art in these areas and highlight the key challenges.