With the proliferation of digital technologies and exacerbation by the chip shortage, there is immense pressure on the semiconductor industry to manufacture products with limited defects and deliver novel innovations to market rapidly. The vast amounts of data produced today create opportunities for the entire industry to maximize production, innovation and cost reduction. Materials-related industry participants know that there is a need for data collaboration to accelerate the total output, production/research and quality, but several challenges must be addressed.

First, individual companies are reluctant to establish an isolated ecosystem given the initial invested cost and time required. Second, companies usually have disparate data systems, elongated learning cycles and lengthy processes to build new capabilities. Finally, there isn’t a suitable solution that solves these issues while preserving intellectual property (IP) and ensuring that companies have full control over their data. Having a collaborative data ecosystem helps identify data relationships that can be automated, as the raw material, finished good material and device process data can all be combined. Equally important is optimizing performance and maintenance for equipment based on specific parameters and on-site performance to direct the right resources to maximize uptime and minimize costs.

It was built for security-conscious customers who need the capability to handle financial data, Personally Identifiable Information (PII), Protected Health Information (PHI), Controlled Unclassified Information (CUI), and even classified government data in a secure and compliant manner. Athinias' strong security enables regulatory requirements across industries and continents by aligning with frameworks like HIPAA, GDPR, and ITAR.

As the software powers mission-critical operations across major corporations and governments alike, the threat model focuses on defeating attacks by highly resourced, technical, and persistent adversaries. To defeat these adversaries, we take a highly opinionated stance and enforce a high minimum bar of security for all customers.

- Whenever parties are not willing to share raw process data, they have the possibility to obfuscate and normalize the data before it is being shared.
- The platform offers powerful obfuscation procedures by tokenizing and encrypting sensitive information in the dataset, such as column names and parts identifiers.
- Sophisticated statistical normalization methods are provided, which can be applied to add an additional layer of security to further protect sensitive data while maintaining its usefulness for advanced analytics such as machine learning.

- The Foundry platform’s robust security end-to-end architecture protects intellectual property and ensures customers always stay in control of their data.
- With tailored and granular permissions, customers control who they share data with, how the data can be used, and for how long.
- Use of data is recorded throughout the platform using powerful data lineage and provenance techniques, even if datasets are combined, shared, aggregated, or machine learning is applied.
- The system can track and monitor who uses the data and how often, and that a purpose is associated with it.
- Multi-level approval workflows ensure that data sharing follows your company’s data governance framework.
- A configurable governance mode that is built to serve unique business needs.

- Our cloud platform’s infrastructure, applications, and operations have been developed to exceed and complement the most rigorous legal and regulatory requirements across multiple industries today, including healthcare and defense.
- Athinia maintains stringent network controls to protect our customers. Fundamental network security principles include Intrusion Detection/Prevention Systems, Data Loss Prevention Systems, and strong encryption of network traffic.
- Detailed audit logs containing user actions, including records of imports, reads, writes, searches, exports & deletions are collected and made available for import into a Security Information and Event Management (SIEM).

Customers have full ownership of their data. Athinia enables data sharing on normalized and obfuscated data.

Data exists in various forms, sources, and ranges. The process manufacturing datasets and variables attributed to different processes may have significant differences in magnitudes and units of measurement. Hence it is essential that the data be transformed in certain use cases to enable the development of a reliable and accurate model. A simplified example of the data platform between Supplier and Integrated Device Maker (IDM) is illustrated in Figure 1.

Figure 1: Example of an Athinia data pipeline between supplier and IDM after being obfuscated and merged.

In the data pipeline shown in Figure 1, normalization is a critical step for two specific reasons: data owners' privacy and for model building purposes. We implement the best standard and peer-reviewed techniques to normalize and obfuscate data to provide data security in a collaborative environment while simultaneously eliminating data leakages. In addition, we combine feature engineering techniques into the normalization process to support machine learning model development. Therefore, it is important to consider and select the appropriate normalization methods and implement the optimal set to balance security and model performance. Solutions to this challenge are discussed in the proceeding chapters of this white paper.

Examples of data normalization techniques are shown in Figure 2. The main topics which will be described include: quantization/discretization, scaling, transforming, feature creation, ranking and examples of academic research.

Figure 2: Examples of standard normalization techniques.

Quantization, or discretization, involves grouping the original values into bins (or buckets). It can apply to numerical values as well as categorical values. Binning transforms continuous variables into discrete ones by creating contiguous intervals spanning the range of variable values. Quantization or discretization can be achieved by two methods: fixed-width binning and quantile binning.

- Not revealing details
- Preventing data from overfitting

- Losing granularity of data for model building

In the case of fixed-width binning, one can set fixed-width bins to quantize data. The bins have custom-designed or automatically segmented ranges. For example, the feature "Wafer Defect Count" has values ranging between 1 and 50 for n-observations of measured wafers. We can set 5 intervals and fit the raw values into these intervals, i.e., 1-10, 11-20, and so on. Each observation will fall into one of these intervals and have the corresponding bin number associated with it (1-10 = bin 1, 11-20 = bin 2 and so forth).

Another binning method is to assign the bins based on the distribution of the data. Fixed-width binning is straightforward and easy to compute; however, it suffers when large differences in the values are exhibited and non-uniform distributions. In our previous example we can imagine that the "Wafer Defect Count" feature has most of its values above 40 and no value below 10. Applying fixed-width binning will lead to empty bins with no data in this case, and majority being encoded in the last bin. Quantile binning divides the data into equal portions and helps when data has a skewed distribution. The discretized result with quartile binning (4-bins) is shown in Table 1. Here we assumed the quartile values are 20, 40, and 45.

When the data owner decides to discretize the values of a feature, they need to specify the normalization method. Specifying the type of binning (fixed-width or quantile) requires the data owner's knowledge of the feature value distribution especially when the data is highly skewed. Creating fewer bins can hide more details but simultaneously reduce the amount of information available to use in the model. An example of what an Input Configuration File will look like is shown in Table 2.

In addition to quantization, scaling is another method that is typically employed when the features differ widely in magnitude. Scaling has been observed to be a necessary step before using the data as an input for model development. Several machine learning algorithms are sensitive to the scale of the input values^{1,2}. In linear regression models, utilizing multiple features with large differences in magnitude can potentially result in numerical stability issues because the model attempts to balance the scales, and hence lead to suboptimal models. Scaling methods always divide the feature by a constant and thus prohibit changing the distribution of the original data, which enables further analysis (i.e., univariate distribution, on the normalized data). In all of the examples to follow, we let *x* denote a vector or array of continuous values with *n* observations.

- Preparing data for model building
- Keeping original data distribution

- Not masking data enough, reversible

A typical method is the standardization scaling scheme, which shrinks the range of feature values and changes the distribution to exhibit a mean of 0 and a variance of 1. To implement this scaling, the mean of x is subtracted from each individual value then the result is divided by the standard deviation of x.

The mean scaling centers the feature values at zero. Once again, the mean is substracted from each value and then divided by the range of x.

The Min-Max scaling method scales the data such that all values fall between 0 and 1. Specifically, each value is subtracted by the minimum value, and divided by range. Minimum values in the original data set will have a value 0, while the maximum values will have the value of 1.

The Max-Abs scaling behaves very similarly to the Min-Max scaling method. All values are divided by the maximum absolute value in *x.* For example, if the maximum value in *x* is 5 and the minimum value is -8, then all values will be divided by 8 since the maximum absolute value of this feature is 8. This method maps values across several ranges based on range and presence of negative values. The rescaled data will have a range bounded between [-1, 1] if the feature takes both negative and positive values. The range will be [0, 1] if only positive values are present.

The Robust scaling is based on percentiles and thus is more robust in terms of treating data where outliers exist. Each value is subtracted by the median value, and then divided by the interquartile range (the range between the 1^{st} quartile and the 3^{rd} quartile). Instead of using the mean value, this method uses the median value as the mean is highly susceptible to outliers and hence in certain cases the median measure gives better results. Compared to the previous methods, data scaled using this robust approach will exhibit a larger range of values. Furthermore, the outliers will still be present in the rescaled data.

All the scaling methods discussed above require the computation of statistical measures based on the original dataset, such as values of mean, median, standard deviation, maximum and minimum. It is necessary that these values remain unchanged across runs of the normalization since the normalized data will be used for building the machine learning model in the next step. If additional data is introduced using the statistical measures of the whole dataset will cause a problem because these values will then also change respectively.

One possible solution is to specify the statistical measures of features in advance. The data owner must provide information of these measures for each of these features to be normalized. These measures could be stored in the configuration file with other feature attributes such that the normalization can be completed automatically with pre-defined functions setup in platform. For a specific parameter, the data owner provides the upper and lower control value, as well as the industry standard. The normalization will then subtract the industry standard value from each independent feature value and then divide by the difference between upper control limit and lower control limit values.

Another solution is to group the data by year and to get the statistical measures of data from last year. If getting measures for features from data owners cannot be achieved, we could use statistical measures from historical data to normalize current data – a method known as imputation. For example, we can use the robust scaling method with historical statistical measures. First, we group the data by year and compute the quartile values for each year, as shown in Table 3. To normalize the data in 2016, we then use the quartile values of 2015. All values of *p*_{1}in 2016 will be subtracted by 5 and then divided by 7 (i.e., 10-3, as the quartile values in 2015 are 3, 5, and 10). All values of *p*_{2}in 2017 will be subtracted by 48 and then divided by 26 (i.e., 50-24, as the quartile values in 2016 are 24, 48, and 50). This procedure is also illustrated in Figure 3.

Figure 3: Using statistical measures from last year to normalize input data.

As discussed in the previous section, the data owner can specify the measures for the features in the configuration file through Athinia’s dedicated platform. Table 4 illustrates the configuration file and corresponding data if the normalization removes the standard value from feature values and then rescales them to the controlled range. Figure 4 demonstrates this procedure in action on a synthetic dataset.

Figure 4: Scatterplots of internal values and normalized values for Parm_2 (all data presented herein is synthetic).

In certain cases, skewed data is often a challenge when developing machine learning models. Transforming variables helps build and improve model performance, as many machine learning models perform better when the variables are normally distributed. (Some common variable transformation techniques that make the variable distribution normal (or Gaussian) are discussed below.

- Better model performance
- Less input from data owners

- Not masking data enough, reversible
- Some transformations require positive values

When feature values are all positive and are highly skewed, applying a logarithmic transformation can help reduce the skewness, and ideally make the resulting distribution of data more Gaussian-like. Each value of x is replaced with log(x). This is often implemented in finance as the raw returns of a given stock or portfolio typically follow a skewed distribution (log-normal), and implementing the logarithmic transform helps in the design of machine learning models but also makes it easy to obtain original data by inverse computation.

The commonly used logarithmic transformation is a specific example of the family of power transforms. In addition, the Box-Cox (6) and Yeo-Johnson (7) transformations are also helpful at transforming feature values (*y*) into more Gaussian like distributions. Unlike logarithmic and Box-Cox transformations that only take positive values as input, the Yeo-Johnson transformation allows negative values as shown in Equation (6) and (7), respectively. As we can see, there is a parameter which is the power parameter and can be chosen to address different data distributations and can be estimated from the underlying data. The application of these power transforms is described in an example dataset below through Foundry.

One example of applying the data transformations is particularly when an input variable distribution is skewed and does not follow a symmetric Gaussian distribution. The advantages of transforming an input dataset using a power transform is that it inherently changes the underlying distribution, typically resulting in improved accuracy in machine learning algorithms and regressions. Linear regression models rely on the assumption that the residual errors are completely random and follow a normal distribution – hence it is imperative to design input data which allows for accurate predictions and applications of the given model. Furthermore, normally distributed input data has been shown achieving optimal predictions because of improved numerical stability and optimization.

In Figure 5, a notional dataset which contains the distribution binned into ten bins is shown in blue. The standard scaling method is applied (substracting the mean and dividing by the standard deviation) to obtain the newly scaled dataset shown in red, indicating no change in the resulting dataset but shifting the values and magnitudes. The probability plot on the right shows how the observed standardized values would compare relative to a perfectly normal Gaussian curve (the red line). It is shown that the correlation coefficient in this example is 95%, where a lot of the outliers at the tails and skew result in the lower than ideal case. However, if a power transform is applied, particularly the Yeo-Johnson in this example, the observed values and transformed distribution resemble that of a Gaussian distribution (yellow distribution curve in the bottom left panel). In addition, the probability plot shows how the transformed values highly coincide with the theoretical quantiles to that of a perfect Gaussian, further supported by the improved correlation coefficient of 0.99.

This particular example shows how applying power transformed in certain cases can completely alter the input variable distribution, simultaneously acting as a normalization and obfuscation technique to maintain security and data privacy of the underlying variable.

Figure 5: Standardization Scaling and Power Transform Illustration on Notional Data.

In Foundry, we uploaded a commonly used dataset related to Sonar measurements to show the benefits of the power transformation. The Sonar dataset describes radar returns of rocks or simulated mines and is a standard machine learning dataset for binary classification. The dataset contains 208 rows and 61 columns with the first 60 columns as real-valued inputs and the last column as the 2-class target variable. We plot the histogram of input variables and observe that many variables are highly skewed (as shown in Figure 5a). We then fit a k-nearest neighbor (KNN) model on the raw dataset and get an accuracy of 0.797 using repeated stratified k-fold cross-validation (Figure 6).

We apply power transformation to the input variables and find that models on the transformed data achieve higher accuracy (Figure 6). Since the Box-Cox transformation only accepts positive values, we perform the Min-Max Scaling to scale data to positive values before the Box-Cox transformation. Applying the Box-Cox transformation to the raw dataset, we see that the histograms of input variables look more Gaussian than the raw data (as shown in Figure 5b), and that the model achieves an accuracy of 0.818. Similarly, the Yeo-Johnson transformation makes the input variables more Gaussian like (as shown in Figure 5c) and improves the accuracy to 0.808.

Figure 6: Histogram plots of input variables from the Sonar dataset.

Figure 7: Data transformation and model building using the Sonar Dataset in Foundry.

When shared data is sensitive, data owners can proceed to apply feature creations by combining several feature variables into new ones which will then be used as inputs for model development. Feature creation is implemented by either aggregation and or establishing relationships between features of interest.

- Straightforward and easy to compute
- Effective new features can improve model performance

- Not masking data enough
- More input is needed from data owners

Aggregating original feature values to compute the mean, min, max, and median values is a useful method to derive new features. A practical example is a supplier dataset which contains records for each material batch, while the outcome of interest is on the less granular level of vendor batch which is comprised of many material batches. The original feature values in the material batch can be aggregated to get the mean value for each vendor batch.

Alternatively, creating new features which describe relationships between existing features can have an impact on the model performance as well as mask the original feature values. In the example below (Table 5), data owners identify that parameters P_{1} and P_{2} contain sensitive data. But the difference or product between the two parameters might not reveal sensitive information without sacrificing practicality for use in predictive modeling. New features can be created as designed by the relationship(s) and the original sensitive features can be removed reducing the dimensionality of the input dataset.

Data owners need to specify the aggregation level for data normalization. If the data owner believes that the variation of a feature, e.g. Param_2 in Table 6, within each lot will not have any impact on future analysis then it might be appropriate to use one of the statistical measures in place of the raw values. This will be implemented only after careful consideration and consultation by the experts in the manufacturing team knowledgeable about that particular data for each lot. If it is determined that using an aggregate value is appropriate, then all data of Param_2 can be aggregated by lot to get mean values. All records of Param_2 within each lot will have the same value, which is the computed arithmetic mean.

Ranking is another powerful data transformation tool. Ranking reduces the complexity of the information by replacing original data with their rank. There are various ranking methods that deal with equal values in the data differently and are summarized in Table 7 using an example dataset.

- Masking original data and reducing complexity
- Focusing on the ordering of the data

- Losing much information
- Not ideal as predictors in model building
- Creating a large range of values

In standard competition ranking, equal values receive the same ranking with leaving gaps after the equal values in the ranking numbers. For example, we have 5 values and sort them to get 110, 130, 130, 150, and 180. The ranking numbers for these values in standard competition ranking are 1, 2, 2, 4, and 5, respectively. because there are 2 equal values after the first-ranked value and the third rank is left out.

Modified competition ranking is similar to the standard competition ranking but leaving gaps before the equal values. With the 5 values from the previous example, we will get the ranking numbers: 1, 3, 3, 4, and 5 in the modified competition ranking. Ranking number 2 is left out and the two equal values are ranked 3.

In dense ranking, no ranking number is left out. The equal values receives the same ranking number, and the next value is ranked the immediately following ranking number. Using the same 5 values: 110, 130, 130, 150, and 180, we get the ranking numbers of 1, 2, 2, 3, and 4. The last ranking number is smaller than the number of values we have when there are equal values.

Unlike previous ranking methods, all values receive unique ranking numbers in the ordinal ranking, even if they compare equal. The ranking numbers for 110, 130, 130, 150, and 180 are 1, 2, 3, 4, and 5, respectively. We need some rules to determine the ranking numbers of equal values because equal values are ranked differently in this method. We could sort equal values by date or batch number so that the results are consistent across runs of normalization.

Equal values receive the same ranking numbers in fractional ranking, but may end up with fractional numbers, as the name suggests. We can think of the fractional ranking as a fairer alternative to the ordinal ranking – equal values now receive the mean of their ranking numbers under the ordinal ranking. For the two equal values, 130, in our example, they both receive a ranking number of 2.5 in the fractional ranking instead of two different ranking numbers, 2 and 3, in the ordinal ranking.

Furthermore, we can use rank correlation coefficients. Two prominent examples of such are Spearman’s rank correlation coefficient (*p*) and Kendall’s rank correlation coefficient (*r*) which measure the relationship between two variables – as one variable increases, the other variable tends to increase or decrease. Both Spearman’s *p* and Kendall’s *r* measure the strength and direction of the association between these two variables. Kendal’s *r* is usually smaller than Spearman’s *p* and is insensitive to error, while Spearman’s *p* is based on deviations and more sensitive to error.

The Spearman’s correlation coefficient, *p*, is calculated as follows:

where *cov(R(X), R(Y))* is the covariance of the rank variables, and *σ*_{R(X)} and *σ*_{R(Y)} are the standard deviation of the rank variables.

Data owners can specify their preferred ranking method in the configuration file. Table 8 shows an example of the configuration file and the normalized data.

These standard techniques are implemented readily in practical applications in industry. However, novel techniques are continuously being researched by the academic community to design improved metrics and measures. The advancement of algorithms, software, and hardware in conjunction with the rising prominence of data quantity requires that researchers address much of the necessary challenges – one being data privacy. Zhang et al. from Princeton University introduce their mechanism to preserve data privacy in the paper “Privacy-preserving Machine Learning through Data Obfuscation. The mechanism applies obfuscation functions both on the individual level and the group level. The method introduces noise, e.g., Gaussian noise, to individual samples to obfuscate individual samples. It also adds carefully crafted training samples into each group to hide the statistical properties of groups. The algorithms are described in Figure 7.

The mechanism might seem too complex for manufacturing data as the paper implements it in the field of image recognition. However, it can inspire us to consider adding crafted synthetic data to original data to hide important measures or statistical properties and hence perform data obfuscation in a unique way.

Athinia is designed to leverage the power of Foundry – software built by Palantir Technologies. The various data transformations and preprocessing techniques described in this document can all be applied in the internal Foundry platform. This enables a one-stop shop for data normalization, obfuscation, aggregation, calculation of statistics. Furthermore, the application of Foundry with Athinia expands on developing operational workflows and designing an ontology to represent the semiconductor industry as a whole. It blends suppliers and integrated device makers into one unique central hub, allowing for data injestion and machine learning models to be built on top and continuously improved as more data is introduced, improving product quality and reducing the supply chain shortage. The expertise of the engineers at Athinia and Palantir help position this platform as the leading edge and state-of-the-art facility necessary to address the most challenging problem in the area of semiconductors of this decade.