Secure Collaboration • Athinia

The need for data sharing in the semiconductor industry.

With the proliferation of digital technologies and exacerbation by the chip shortage, there is immense pressure on the semiconductor industry to manufacture products with limited defects and deliver novel innovations to market rapidly. The vast amounts of data produced today create opportunities for the entire industry to maximize production, innovation and cost reduction. Materials-related industry participants know that there is a need for data collaboration to accelerate the total output, production/research and quality, but several challenges must be addressed.

First, individual companies are reluctant to establish an isolated ecosystem given the initial invested cost and time required. Second, companies usually have disparate data systems, elongated learning cycles and lengthy processes to build new capabilities. Finally, there isn’t a suitable solution that solves these issues while preserving intellectual property (IP) and ensuring that companies have full control over their data. Having a collaborative data ecosystem helps identify data relationships that can be automated, as the raw material, finished good material and device process data can all be combined. Equally important is optimizing performance and maintenance for equipment based on specific parameters and on-site performance to direct the right resources to maximize uptime and minimize costs.

Security

The Athinia® platform helps Semiconductor Device Makers and their suppliers to solve real-world problems using a powerful, secure software platform based on Palantir Foundry.

It was built for security-conscious customers who need the capability to handle financial data, Personally Identifiable Information (PII), Protected Health Information (PHI), Controlled Unclassified Information (CUI), and even classified government data in a secure and compliant manner. Athinias' strong security enables regulatory requirements across industries and continents by aligning with frameworks like HIPAA, GDPR, and ITAR.

As the software powers mission-critical operations across major corporations and governments alike, the threat model focuses on defeating attacks by highly resourced, technical, and persistent adversaries. To defeat these adversaries, we take a highly opinionated stance and enforce a high minimum bar of security for all customers.

Learn More

Data Protection

Whenever parties are not willing to share raw process data, they have the possibility to obfuscate and normalize the data before it is being shared.
The platform offers powerful obfuscation procedures by tokenizing and encrypting sensitive information in the dataset, such as column names and parts identifiers.
Sophisticated statistical normalization methods are provided, which can be applied to add an additional layer of security to further protect sensitive data while maintaining its usefulness for advanced analytics such as machine learning.

Data Governance

The Foundry platform’s robust security end-to-end architecture protects intellectual property and ensures customers always stay in control of their data.
With tailored and granular permissions, customers control who they share data with, how the data can be used, and for how long.
Use of data is recorded throughout the platform using powerful data lineage and provenance techniques, even if datasets are combined, shared, aggregated, or machine learning is applied.
The system can track and monitor who uses the data and how often, and that a purpose is associated with it.
Multi-level approval workflows ensure that data sharing follows your company’s data governance framework.
A configurable governance mode that is built to serve unique business needs.

Infrastructure Security

Our cloud platform’s infrastructure, applications, and operations have been developed to exceed and complement the most rigorous legal and regulatory requirements across multiple industries today, including healthcare and defense.
Athinia maintains stringent network controls to protect our customers. Fundamental network security principles include Intrusion Detection/Prevention Systems, Data Loss Prevention Systems, and strong encryption of network traffic.
Detailed audit logs containing user actions, including records of imports, reads, writes, searches, exports & deletions are collected and made available for import into a Security Information and Event Management (SIEM).

Secure Data Collaboration ecosystem with tailored permissions

Customers have full ownership of their data. Athinia enables data sharing on normalized and obfuscated data.

Data pipeline in Athinia

Data exists in various forms, sources, and ranges. The process manufacturing datasets and variables attributed to different processes may have significant differences in magnitudes and units of measurement. Hence it is essential that the data be transformed in certain use cases to enable the development of a reliable and accurate model. A simplified example of the data platform between Supplier and Integrated Device Maker (IDM) is illustrated in Figure 1.

Figure 1: Example of an Athinia data pipeline between supplier and IDM after being obfuscated and merged.

Data Normalization

In the data pipeline shown in Figure 1, normalization is a critical step for two specific reasons: data owners' privacy and for model building purposes. We implement the best standard and peer-reviewed techniques to normalize and obfuscate data to provide data security in a collaborative environment while simultaneously eliminating data leakages. In addition, we combine feature engineering techniques into the normalization process to support machine learning model development. Therefore, it is important to consider and select the appropriate normalization methods and implement the optimal set to balance security and model performance. Solutions to this challenge are discussed in the proceeding chapters of this white paper.

Data normalization techniques

Examples of data normalization techniques are shown in Figure 2. The main topics which will be described include: quantization/discretization, scaling, transforming, feature creation, ranking and examples of academic research.

Figure 2: Examples of standard normalization techniques.

Quantization (Discretization)

Quantization, or discretization, involves grouping the original values into bins (or buckets). It can apply to numerical values as well as categorical values. Binning transforms continuous variables into discrete ones by creating contiguous intervals spanning the range of variable values. Quantization or discretization can be achieved by two methods: fixed-width binning and quantile binning.

PROS

Not revealing details
Preventing data from overfitting

CONS

Losing granularity of data for model building

Fixed-Width Binning

In the case of fixed-width binning, one can set fixed-width bins to quantize data. The bins have custom-designed or automatically segmented ranges. For example, the feature "Wafer Defect Count" has values ranging between 1 and 50 for n-observations of measured wafers. We can set 5 intervals and fit the raw values into these intervals, i.e., 1-10, 11-20, and so on. Each observation will fall into one of these intervals and have the corresponding bin number associated with it (1-10 = bin 1, 11-20 = bin 2 and so forth).

Quantile Binning: Grouping Based on the Distribution of Data

Another binning method is to assign the bins based on the distribution of the data. Fixed-width binning is straightforward and easy to compute; however, it suffers when large differences in the values are exhibited and non-uniform distributions. In our previous example we can imagine that the "Wafer Defect Count" feature has most of its values above 40 and no value below 10. Applying fixed-width binning will lead to empty bins with no data in this case, and majority being encoded in the last bin. Quantile binning divides the data into equal portions and helps when data has a skewed distribution. The discretized result with quartile binning (4-bins) is shown in Table 1. Here we assumed the quartile values are 20, 40, and 45.

Table 1: Quantization example (Data presented herein is synthetic)

Batch Number	Wafer Defect Count	Wafer Defect Count - Discretized (Fixed-Width Binning)	Wafer Defect Count - Discretized (Quantile Binning)
B000001	43	41-50	41-45
B000002	39	31-40	21-40
B000003	45	41-50	41-45
B000004	27	21-30	21-40
B000005	48	41-50	46-50
B000006	14	11-20	1-20
B000007	47	41-50	46-50

Implementing Quantization - Data Owner's Input in the Configuration File

When the data owner decides to discretize the values of a feature, they need to specify the normalization method. Specifying the type of binning (fixed-width or quantile) requires the data owner's knowledge of the feature value distribution especially when the data is highly skewed. Creating fewer bins can hide more details but simultaneously reduce the amount of information available to use in the model. An example of what an Input Configuration File will look like is shown in Table 2.

Table 2: Quantization configuration file (data presented herein is synthetic)

Name	Param_1	Internal Value	Normalized Value
Description	xxxxxxxxxxxx	15.34	11-20
Category	Patterned	23.75	21-30
Type	Double	65.79	61-70
Unit	%	47.36	41-50
Min	11	85.63	81-90
Max	90	34.68	31-40
Normalization Method	Binning	43.58	41-50
Type of Binning	Fixed-Width	56.86	51-60
Number of Bins	8	64.32	61-70

Scaling (Feature Normalization)

In addition to quantization, scaling is another method that is typically employed when the features differ widely in magnitude. Scaling has been observed to be a necessary step before using the data as an input for model development. Several machine learning algorithms are sensitive to the scale of the input values^1,2. In linear regression models, utilizing multiple features with large differences in magnitude can potentially result in numerical stability issues because the model attempts to balance the scales, and hence lead to suboptimal models. Scaling methods always divide the feature by a constant and thus prohibit changing the distribution of the original data, which enables further analysis (i.e., univariate distribution, on the normalized data). In all of the examples to follow, we let x denote a vector or array of continuous values with n observations.

PROS

Preparing data for model building
Keeping original data distribution

CONS

Not masking data enough, reversible

Standardization Scaling

A typical method is the standardization scaling scheme, which shrinks the range of feature values and changes the distribution to exhibit a mean of 0 and a variance of 1. To implement this scaling, the mean of x is subtracted from each individual value then the result is divided by the standard deviation of x.

Mean Scaling

The mean scaling centers the feature values at zero. Once again, the mean is substracted from each value and then divided by the range of x.

Min-Max Scaling

The Min-Max scaling method scales the data such that all values fall between 0 and 1. Specifically, each value is subtracted by the minimum value, and divided by range. Minimum values in the original data set will have a value 0, while the maximum values will have the value of 1.

¹Kuhn, M. and Johnson, K., 2019. Feature engineering and selection: A practical approach for predictive models. CRC Press.

² http://www.feat.engineering/numeric-one-to-one.html

Max-Abs Scaling

The Max-Abs scaling behaves very similarly to the Min-Max scaling method. All values are divided by the maximum absolute value in x. For example, if the maximum value in x is 5 and the minimum value is -8, then all values will be divided by 8 since the maximum absolute value of this feature is 8. This method maps values across several ranges based on range and presence of negative values. The rescaled data will have a range bounded between [-1, 1] if the feature takes both negative and positive values. The range will be [0, 1] if only positive values are present.

Robust Scaling

The Robust scaling is based on percentiles and thus is more robust in terms of treating data where outliers exist. Each value is subtracted by the median value, and then divided by the interquartile range (the range between the 1^st quartile and the 3^rd quartile). Instead of using the mean value, this method uses the median value as the mean is highly susceptible to outliers and hence in certain cases the median measure gives better results. Compared to the previous methods, data scaled using this robust approach will exhibit a larger range of values. Furthermore, the outliers will still be present in the rescaled data.

Scaling Methods in the Data Pipeline

All the scaling methods discussed above require the computation of statistical measures based on the original dataset, such as values of mean, median, standard deviation, maximum and minimum. It is necessary that these values remain unchanged across runs of the normalization since the normalized data will be used for building the machine learning model in the next step. If additional data is introduced using the statistical measures of the whole dataset will cause a problem because these values will then also change respectively.

One possible solution is to specify the statistical measures of features in advance. The data owner must provide information of these measures for each of these features to be normalized. These measures could be stored in the configuration file with other feature attributes such that the normalization can be completed automatically with pre-defined functions setup in platform. For a specific parameter, the data owner provides the upper and lower control value, as well as the industry standard. The normalization will then subtract the industry standard value from each independent feature value and then divide by the difference between upper control limit and lower control limit values.

Another solution is to group the data by year and to get the statistical measures of data from last year. If getting measures for features from data owners cannot be achieved, we could use statistical measures from historical data to normalize current data – a method known as imputation. For example, we can use the robust scaling method with historical statistical measures. First, we group the data by year and compute the quartile values for each year, as shown in Table 3. To normalize the data in 2016, we then use the quartile values of 2015. All values of p₁in 2016 will be subtracted by 5 and then divided by 7 (i.e., 10-3, as the quartile values in 2015 are 3, 5, and 10). All values of p₂in 2017 will be subtracted by 48 and then divided by 26 (i.e., 50-24, as the quartile values in 2016 are 24, 48, and 50). This procedure is also illustrated in Figure 3.

Table 3: Scaling methods example (all data herein is synthetic)

Year	P₁ Quartiles	P₂ Quartiles	P₃ Quartiles
2015	[3, 5, 10]	[22, 45, 50]	[0.23, 0.45, 0.83]
2016	[4, 5, 10]	[24, 48, 50]	[0.24, 0.45, 0.82]
2017	[3, 4, 9]	[23, 47, 52]	[0.22, 0.47, 0.83]
2018	[3, 5, 10]	[24, 44, 51]	[0.23, 0.45, 0.82]

Figure 3: Using statistical measures from last year to normalize input data.

Implementing Scaling – Data Owner’s Input in the Configuration File

As discussed in the previous section, the data owner can specify the measures for the features in the configuration file through Athinia’s dedicated platform. Table 4 illustrates the configuration file and corresponding data if the normalization removes the standard value from feature values and then rescales them to the controlled range. Figure 4 demonstrates this procedure in action on a synthetic dataset.

Table 4: Scaling configuration file and corresponding data (all data herein is synthetic).

Name	Param_2
Description	xxxxxx
Category	Patterned
Type	int
Unit	None
Min	23
Max	57
Normalization Method	Scaling
Standard Value	35
Lower Control Value	27
Upper Control Value	48

Internal value	Normalized Value
32	-0.14286
34	-0.04762
29	-0.28571
48	0.619048
26	-0.42857
37	0.095238
23	-0.57143
48	0.619048
51	0.761905
27	-0.38095

Figure 4: Scatterplots of internal values and normalized values for Parm_2 (all data presented herein is synthetic).

Variable Transformation

In certain cases, skewed data is often a challenge when developing machine learning models. Transforming variables helps build and improve model performance, as many machine learning models perform better when the variables are normally distributed. (Some common variable transformation techniques that make the variable distribution normal (or Gaussian) are discussed below.

PROS

Better model performance
Less input from data owners

CONS

Not masking data enough, reversible
Some transformations require positive values

Logarithmic Transformation

When feature values are all positive and are highly skewed, applying a logarithmic transformation can help reduce the skewness, and ideally make the resulting distribution of data more Gaussian-like. Each value of x is replaced with log(x). This is often implemented in finance as the raw returns of a given stock or portfolio typically follow a skewed distribution (log-normal), and implementing the logarithmic transform helps in the design of machine learning models but also makes it easy to obtain original data by inverse computation.

Power Transformation

The commonly used logarithmic transformation is a specific example of the family of power transforms. In addition, the Box-Cox (6) and Yeo-Johnson (7) transformations are also helpful at transforming feature values (y) into more Gaussian like distributions. Unlike logarithmic and Box-Cox transformations that only take positive values as input, the Yeo-Johnson transformation allows negative values as shown in Equation (6) and (7), respectively. As we can see, there is a parameter which is the power parameter and can be chosen to address different data distributations and can be estimated from the underlying data. The application of these power transforms is described in an example dataset below through Foundry.

Application of Transformations in Skewed Data

One example of applying the data transformations is particularly when an input variable distribution is skewed and does not follow a symmetric Gaussian distribution. The advantages of transforming an input dataset using a power transform is that it inherently changes the underlying distribution, typically resulting in improved accuracy in machine learning algorithms and regressions. Linear regression models rely on the assumption that the residual errors are completely random and follow a normal distribution – hence it is imperative to design input data which allows for accurate predictions and applications of the given model. Furthermore, normally distributed input data has been shown achieving optimal predictions because of improved numerical stability and optimization.

In Figure 5, a notional dataset which contains the distribution binned into ten bins is shown in blue. The standard scaling method is applied (substracting the mean and dividing by the standard deviation) to obtain the newly scaled dataset shown in red, indicating no change in the resulting dataset but shifting the values and magnitudes. The probability plot on the right shows how the observed standardized values would compare relative to a perfectly normal Gaussian curve (the red line). It is shown that the correlation coefficient in this example is 95%, where a lot of the outliers at the tails and skew result in the lower than ideal case. However, if a power transform is applied, particularly the Yeo-Johnson in this example, the observed values and transformed distribution resemble that of a Gaussian distribution (yellow distribution curve in the bottom left panel). In addition, the probability plot shows how the transformed values highly coincide with the theoretical quantiles to that of a perfect Gaussian, further supported by the improved correlation coefficient of 0.99.

This particular example shows how applying power transformed in certain cases can completely alter the input variable distribution, simultaneously acting as a normalization and obfuscation technique to maintain security and data privacy of the underlying variable.

Figure 5: Standardization Scaling and Power Transform Illustration on Notional Data.

Sonar Dataset: Power Transformation Improves Model Performance

In Foundry, we uploaded a commonly used dataset related to Sonar measurements to show the benefits of the power transformation. The Sonar dataset describes radar returns of rocks or simulated mines and is a standard machine learning dataset for binary classification. The dataset contains 208 rows and 61 columns with the first 60 columns as real-valued inputs and the last column as the 2-class target variable. We plot the histogram of input variables and observe that many variables are highly skewed (as shown in Figure 5a). We then fit a k-nearest neighbor (KNN) model on the raw dataset and get an accuracy of 0.797 using repeated stratified k-fold cross-validation (Figure 6).

We apply power transformation to the input variables and find that models on the transformed data achieve higher accuracy (Figure 6). Since the Box-Cox transformation only accepts positive values, we perform the Min-Max Scaling to scale data to positive values before the Box-Cox transformation. Applying the Box-Cox transformation to the raw dataset, we see that the histograms of input variables look more Gaussian than the raw data (as shown in Figure 5b), and that the model achieves an accuracy of 0.818. Similarly, the Yeo-Johnson transformation makes the input variables more Gaussian like (as shown in Figure 5c) and improves the accuracy to 0.808.

Figure 6: Histogram plots of input variables from the Sonar dataset.

Figure 7: Data transformation and model building using the Sonar Dataset in Foundry.

Feature Creation

When shared data is sensitive, data owners can proceed to apply feature creations by combining several feature variables into new ones which will then be used as inputs for model development. Feature creation is implemented by either aggregation and or establishing relationships between features of interest.

PROS

Straightforward and easy to compute
Effective new features can improve model performance

CONS

Not masking data enough
More input is needed from data owners

Aggregation

Aggregating original feature values to compute the mean, min, max, and median values is a useful method to derive new features. A practical example is a supplier dataset which contains records for each material batch, while the outcome of interest is on the less granular level of vendor batch which is comprised of many material batches. The original feature values in the material batch can be aggregated to get the mean value for each vendor batch.

Relationship

Alternatively, creating new features which describe relationships between existing features can have an impact on the model performance as well as mask the original feature values. In the example below (Table 5), data owners identify that parameters P₁ and P₂ contain sensitive data. But the difference or product between the two parameters might not reveal sensitive information without sacrificing practicality for use in predictive modeling. New features can be created as designed by the relationship(s) and the original sensitive features can be removed reducing the dimensionality of the input dataset.

Table 5: Feature creation example (all data herein is synthetic).

Vendor Batch Number	Material Batch Number	P₁	P₂	New Parameter	New Parameter
12345	B000001	15	3	16	12
12345	B000002	17	4	16	13
12345	B000003	16	2	16	14
23456	B000004	14	4	13	10
23456	B000005	15	5	13	10
23456	B000006	12	3	13	9
23456	B000007	11	2	13	9

Implementing Feature Creation – Data Owner’s Input in the Configuration File

Data owners need to specify the aggregation level for data normalization. If the data owner believes that the variation of a feature, e.g. Param_2 in Table 6, within each lot will not have any impact on future analysis then it might be appropriate to use one of the statistical measures in place of the raw values. This will be implemented only after careful consideration and consultation by the experts in the manufacturing team knowledgeable about that particular data for each lot. If it is determined that using an aggregate value is appropriate, then all data of Param_2 can be aggregated by lot to get mean values. All records of Param_2 within each lot will have the same value, which is the computed arithmetic mean.

Table 6: Aggregation configuration file and corresponding data (all data herein is synthetic).

Name	Param_2
Description	xxxxxx
Category	Patterned
Type	int
Unit	None
Min	23
Max	57
Normalization Method	Aggregation
Aggregation Level	Lot
Aggregation Value	Mean

Lot Number	Internal value	Normalized Value
L5	32	31.67
L5	34	31.67
L5	29	31.67
L10	48	37
L10	26	37
L10	37	37
L15	23	35.5
L15	48	35.5
L17	52	52

Ranking

Ranking is another powerful data transformation tool. Ranking reduces the complexity of the information by replacing original data with their rank. There are various ranking methods that deal with equal values in the data differently and are summarized in Table 7 using an example dataset.

PROS

Masking original data and reducing complexity
Focusing on the ordering of the data

CONS

Losing much information
Not ideal as predictors in model building
Creating a large range of values

Standard Competition Ranking (“1224”)

In standard competition ranking, equal values receive the same ranking with leaving gaps after the equal values in the ranking numbers. For example, we have 5 values and sort them to get 110, 130, 130, 150, and 180. The ranking numbers for these values in standard competition ranking are 1, 2, 2, 4, and 5, respectively. because there are 2 equal values after the first-ranked value and the third rank is left out.

x	110	130	130	150	180
Rank(x)	1	2	2	4	5

Modified Competition Ranking (“1334”)

Modified competition ranking is similar to the standard competition ranking but leaving gaps before the equal values. With the 5 values from the previous example, we will get the ranking numbers: 1, 3, 3, 4, and 5 in the modified competition ranking. Ranking number 2 is left out and the two equal values are ranked 3.

x	110	130	130	150	180
Rank(x)	1	3	3	4	5

Dense Ranking (“1223”)

In dense ranking, no ranking number is left out. The equal values receives the same ranking number, and the next value is ranked the immediately following ranking number. Using the same 5 values: 110, 130, 130, 150, and 180, we get the ranking numbers of 1, 2, 2, 3, and 4. The last ranking number is smaller than the number of values we have when there are equal values.

x	110	130	130	150	180
Rank(x)	1	2	2	3	4

Ordinal Ranking (“1234”)

Unlike previous ranking methods, all values receive unique ranking numbers in the ordinal ranking, even if they compare equal. The ranking numbers for 110, 130, 130, 150, and 180 are 1, 2, 3, 4, and 5, respectively. We need some rules to determine the ranking numbers of equal values because equal values are ranked differently in this method. We could sort equal values by date or batch number so that the results are consistent across runs of normalization.

x	110	130	130	150	180
Rank(x)	1	2	3	4	5

Fractional Ranking (“1 2.5 2.5 4”)

Equal values receive the same ranking numbers in fractional ranking, but may end up with fractional numbers, as the name suggests. We can think of the fractional ranking as a fairer alternative to the ordinal ranking – equal values now receive the mean of their ranking numbers under the ordinal ranking. For the two equal values, 130, in our example, they both receive a ranking number of 2.5 in the fractional ranking instead of two different ranking numbers, 2 and 3, in the ordinal ranking.

x	110	130	130	150	180
Rank(x)	1	2.5	2.5	4	5

Rank Correlation Coefficients

Furthermore, we can use rank correlation coefficients. Two prominent examples of such are Spearman’s rank correlation coefficient (p) and Kendall’s rank correlation coefficient (r) which measure the relationship between two variables – as one variable increases, the other variable tends to increase or decrease. Both Spearman’s p and Kendall’s r measure the strength and direction of the association between these two variables. Kendal’s r is usually smaller than Spearman’s p and is insensitive to error, while Spearman’s p is based on deviations and more sensitive to error.

The Spearman’s correlation coefficient, p, is calculated as follows:

where cov(R(X), R(Y)) is the covariance of the rank variables, and σ_R(X) and σ_R(Y) are the standard deviation of the rank variables.

Table 7: Ranking example with different ranking methods (all data herein is synthetic)

Internal Value	Standard Competition Ranking	Modified Competition Ranking	Dense Ranking	Ordinal Ranking	Fractional Ranking
45	1	1	1	1	1
52	6	6	5	6	6
47	3	4	3	3	3.5
46	2	2	2	2	2
47	3	4	3	4	3.5
51	5	5	4	5	5

Implementing Ranking – Data Owner’s Input in the Configuration File

Data owners can specify their preferred ranking method in the configuration file. Table 8 shows an example of the configuration file and the normalized data.

Table 8: Ranking configuration file and corresponding data (all data herein is synthethic)

Name	Param_2
Description	xxxxxx
Category	Patterned
Type	int
Unit	None
Min	23
Max	57
Normalization Method	Ranking
Ascending	True
Ranking type	Dense

Internal value	Normalized Value
37	5
34	4
29	3
48	7
26	2
37	5
23	1
45	6
48	7

Solutions

Academic Research

These standard techniques are implemented readily in practical applications in industry. However, novel techniques are continuously being researched by the academic community to design improved metrics and measures. The advancement of algorithms, software, and hardware in conjunction with the rising prominence of data quantity requires that researchers address much of the necessary challenges – one being data privacy. Zhang et al. from Princeton University introduce their mechanism to preserve data privacy in the paper “Privacy-preserving Machine Learning through Data Obfuscation. The mechanism applies obfuscation functions both on the individual level and the group level. The method introduces noise, e.g., Gaussian noise, to individual samples to obfuscate individual samples. It also adds carefully crafted training samples into each group to hide the statistical properties of groups. The algorithms are described in Figure 7.

The mechanism might seem too complex for manufacturing data as the paper implements it in the field of image recognition. However, it can inspire us to consider adding crafted synthetic data to original data to hide important measures or statistical properties and hence perform data obfuscation in a unique way.

Figure 8: Data obfuscation algorithms (Zhang et al., 2018).

Conclusion and Future Outlook

Athinia is designed to leverage the power of Foundry – software built by Palantir Technologies. The various data transformations and preprocessing techniques described in this document can all be applied in the internal Foundry platform. This enables a one-stop shop for data normalization, obfuscation, aggregation, calculation of statistics. Furthermore, the application of Foundry with Athinia expands on developing operational workflows and designing an ontology to represent the semiconductor industry as a whole. It blends suppliers and integrated device makers into one unique central hub, allowing for data injestion and machine learning models to be built on top and continuously improved as more data is introduced, improving product quality and reducing the supply chain shortage. The expertise of the engineers at Athinia and Palantir help position this platform as the leading edge and state-of-the-art facility necessary to address the most challenging problem in the area of semiconductors of this decade.

Secure Data Collaboration

The need for data sharing in the semiconductor industry.

Security

The Athinia® platform helps Semiconductor Device Makers and their suppliers to solve real-world problems using a powerful, secure software platform based on Palantir Foundry.

Data Protection​

Data Governance​

Infrastructure ​Security​

Data Protection​

Data Governance​

Infrastructure ​Security​

Secure Data Collaboration ecosystem with tailored permissions​

Data pipeline in Athinia

Data Normalization

Data normalization techniques

Quantization (Discretization)

PROS

CONS

Fixed-Width Binning

Quantile Binning: Grouping Based on the Distribution of Data

Table 1: Quantization example (Data presented herein is synthetic)

Implementing Quantization - Data Owner's Input in the Configuration File

Table 2: Quantization configuration file (data presented herein is synthetic)

Scaling (Feature Normalization)

PROS

CONS

Standardization Scaling

Mean Scaling

Min-Max Scaling

Max-Abs Scaling

Robust Scaling

Scaling Methods in the Data Pipeline

Table 3: Scaling methods example (all data herein is synthetic)

Implementing Scaling – Data Owner’s Input in the Configuration File

Table 4: Scaling configuration file and corresponding data (all data herein is synthetic).

Variable Transformation

PROS

CONS

Logarithmic Transformation

Power Transformation

Application of Transformations in Skewed Data

Sonar Dataset: Power Transformation Improves Model Performance

Feature Creation

PROS

CONS

Aggregation

Relationship

Table 5: Feature creation example (all data herein is synthetic).

Implementing Feature Creation – Data Owner’s Input in the Configuration File

Table 6: Aggregation configuration file and corresponding data (all data herein is synthetic).

Ranking

PROS

CONS

Standard Competition Ranking (“1224”)

Modified Competition Ranking (“1334”)

Dense Ranking (“1223”)

Ordinal Ranking (“1234”)

Fractional Ranking (“1 2.5 2.5 4”)

Rank Correlation Coefficients

Table 7: Ranking example with different ranking methods (all data herein is synthetic)

Implementing Ranking – Data Owner’s Input in the Configuration File

Table 8: Ranking configuration file and corresponding data (all data herein is synthethic)

Solutions

Academic Research

Figure 8: Data obfuscation algorithms (Zhang et al., 2018).

Conclusion and Future Outlook

Data Protection

Data Governance

Infrastructure Security

Data Protection

Data Governance

Infrastructure Security

Secure Data Collaboration ecosystem with tailored permissions