Semiconductor designers and manufacturers lean heavily on statistical software to achieve successful development and fabrication of products. With the complexities of licensing and unpredictable pricing models, many companies are opting for alternate solutions to SAS software. Often the road to accomplish such a migration appears formidable given the many years of coding committed to the SAS environment. Suggestions that a migration be attempted are usually met with much trepidation by those who have invested years creating exotic SAS applications. Texas Instruments has just completed a three-year journey to migrate all SAS code to alternate open-source statistical solutions, including open-source R. The results of this effort, in addition to significant reductions in cost for software, include supplying a standardized statistical platform that new engineers will be using for statistics, and code that is faster, more scalable, and more user friendly than the previous SAS code.
The primary driver for most companies to migrate from SAS usually centers on the cost of licensing and the complexities of the contract.
In addition to this, R is fast becoming the statistical software of choice among academia. To realize rapid ramp to contribution, it is a real catalyst to use the same tools that new college hires are accustomed to using.
Being an open-source platform, R has matured into a flexible and comprehensive statistical analysis suite.
Last, but not least, the economic advantages to open-source R are not to be ignored. Business units experience great pressure from superiors when faced with a step-function increase in licensing costs because of, for example, some change in the source of the data to be analyzed.
Barriers to Break
The path from SAS to R was, for TI, one with several obstacles to overcome. The first major hurdle was a technical one. The size of the data to be analyzed posed a significant challenge that could not be solved using R in the manner in which TI analysts were accustomed.
Hurdle number two was a psychological one. SAS is primarily a coding language used by a handful of highly experienced users primarily focused on analyses and generating visualizations of those analyses. The consumers of these analyses and visualizations comprise a much larger audience. The use model for R is very much the same and the use of either language is a model for developing a large code base that runs automatically. It consists of many program blocks that are modified only when changes are needed. In addition to this, R is most efficient and powerful when used as an object-oriented language, a paradigm shift for those steeped in years of SAS code generation.
Even though the community of SAS code developers was quite small, across years of usage, much cloning of existing programs occurred. Some would make small changes in existing code to create variations of analyses. This practice of code-cloning burgeoned into a vast junkyard of code that, to the casual observer, made the idea of migrating off SAS appear to be a project comparable to boiling the ocean.
The change from SAS to R in TI was not one of sudden and dramatic change, but can be described more accurately as an evolution of change.
SAS on the Mainframe was used primarily to extract data and do analyses using Base SAS. The solution here was much simpler: TI chose to license software from WPS (World Programming System) in the UK that can use programs written in SAS syntax.
Usage of SAS in the Unix/Linux environment was of a much larger scope. The code base included two large and complex applications as well as an ocean of user-specific programs. Of course the first major barrier was the technical one for dealing with big data. TI perused several solutions including proprietary R but was reluctant to purchase another solution. Instead a contractor was retained who provided a proof of concept for dealing with big data using R. An elegant solution, this proof of concept showed that the first major technical hurdle could be cleared in a single bound.
At this point an organic approach was taken to begin the migration from SAS to R. Technical experts in the company began tackling the two major applications. Fears about the impossibility of the task began to fade as these analysts began moving the mountain.
Instruction in using R was also put in place, with training classes made available at a local university in addition to in-house training using a contractor with world-class R expertise and knowledge of SAS. Classes in the basics of R were also conducted using experts inside TI. To foster the organic approach, collaboration meetings and tools were put in place to support the new growing community of R users.
After a year or so of the organic approach to the migration, a change in manufacturing locations required another licensing cost increase. Faced with increasing costs, the organic approach began to evolve into a hybrid approach that now included a deliberate decision to decrease the use of SAS. During the last six months, all businesses were fully committed to the project and the migration was driven top-down from upper management.
The two major applications were replaced, one with an R solution and the other with an internal data analytics application made up entirely of open-source code.
To address the junkyard of cloned code, a systematic approach was taken to modularize the R code so that only a few programs would be needed and could be used for a variety of similar data by generating data-specific configuration input files.
At the inception of the migration effort analysts were concerned about the performance of R versus SAS for some of our big data. Table 1 shows a comparison of several SAS procedures against an equivalent R solution on a large data file of around 27 Gbytes.
So at the outset, the R solution promised to be a superior alternative to the existing SAS solution. Near the end of the project, with so many files of statistical tables and graphs of the analyzed data available to consumers within the company, an assessment of the needed storage capacity was done. Archives of SAS data sets needed to be converted to R data sets in order to be able to access and analyze it using R. Conversion of these yielded a 7X+ reduction in storage needed for the R data sets.
This article describes a number of valuable takeaways, including the unique multiple-solution approach that involves both a main frame and a distributed Linux solution. This may start out as a grass roots movement but can easily evolve into a top management priority. And one can even anticipate the money savings opportunities before it gets to that point. If a company follows a similar path to the one outlined in this article, there is little doubt that they will find the conversion model useful. Keep in mind that establishing a plan to educate company analysts in the new methods is also crucial. There may be a few existing resources out there on SAS to R conversion solutions, but this article is ultimately meant to help readers say, “Hey, we can actually do this!”
Joel Dobson is a senior statistician in the Central Quality Group at Texas Instruments and Clayton Gibbs is a Member of Group Technical Staff at Texas Instruments and is a Business Analyst for engineering IT solutions.