Models
The remarkable capabilities of today's artificial intelligence (AI) are based on machine learning models in the form of deep artificial neural networks. Neural network models consist of weight matrices (parameters) that define the relationships between input data and the model's outputs (responses). During training on large datasets, these weights are calculated using gradient methods, enabling the model to encode complex logic for analyzing input data in its weights, based on statistical patterns identified during training.
AI models are one of the major attack vectors on AI systems for the following key reasons:
1. Models as a means of delivering undocumented functionality.
Modern AI models contain an enormous number of parameters, sometimes trillions, making them extremely difficult to verify. Unlike conventional software, models do not have interpretable code with clear variables and instructions, making it hard to determine all the embedded functionality.
2. Models as valuable assets that attract attackers.
Developing and training AI models requires substantial intellectual and material resources, including advanced scientific and applied research, computational power, huge training datasets, and time. Only a handful of companies can afford to train foundational AI models [14]. Protecting these models is not just about security technical systems – it’s also about safeguarding valuable intellectual property.
Key vulnerabilities of AI models
1. Model poisoning
This attack involves injecting malicious functionality into a model's architecture and/or weights. It exploits vulnerabilities in the serialization formats of models of popular machine learning platforms, enabling arbitrary code execution [9].
Examples include an attacker adding layers to the model architecture to steal input data or encoding malware into the model weights, potentially with self-learning capabilities.
To evade detection, the poisoned model must maintain performance quality comparable to that of a clean model.
2. Backdoors in models
This is a type of poisoning attack where the attacker embeds hidden logic into the model, such as misclassifying certain objects or failing to detect specific object classes which activates only when a particular trigger is present in the input data. Without the trigger, the model functions as intended by its developers.
The backdoor can be introduced into the model via weights - poisoning the training dataset by injecting malicious samples and then retraining the model on the entire dataset [11], or fine-tuning the model exclusively on the poisoned samples [12]. In the latter case, the trigger synthesis algorithm analyzes neuron responses within the targeted model itself.
Another way to deliver a backdoor is to change the model architecture. A subnet (payload) is added to the model, trained to detect the trigger in input data, along with a block that determines the behavior (output) of the model when the trigger is present [10, 13]. This approach is less resource-intensive because the model doesn't need to be trained on poisoned data; it also doesn't degrade the model's performance on clean data and offers better transferability across models.
It's important to note that triggers can include both synthetic objects, such as adversarial patches (textures in images that are not interpretable by humans), and real-world objects, such as a traffic light or sign in a road scene. This significantly expands the practical applications of this type of attack type in real-world scenarios.
3. Model extraction
In this attack, the attacker uses legitimate access to the targeted model's outputs to extract its knowledge and transfer it into their own model[1, 2]. As a result, the attacker obtains a functional replica without physically stealing the model, bypassing traditional information security measures.
One method used in this attack is "knowledge distillation" [3]. This method is widely used for model compression. A compact "student" model is trained using the outputs of a larger "teacher" model, which serves as an "oracle" by providing data labeling and confidence scores in this labeling (so-called "soft labeling"). This supervised learning process guides the student model to replicate the teacher model's behavior. In an extraction attack, the attacked model acts as the teacher, and the attacker's model is the student. One example involved the extraction of knowledge from a model with 193 million parameters trained on a proprietary dataset of 1 billion images [2].
A straightforward way to prevent these attacks is to limit the number of queries allowed to the model. However, attackers can overcome this by leveraging advanced techniques like active learning [5, 6, 8] or semi-supervised learning [7], which require fewer labelled examples to train models effectively.
A successful model extraction attack, conducted in a partial (grey-box) or zero (black-box) knowledge mode, enables attackers to perform more sophisticated. With full (white-box) knowledge of the model, they can carry out adversarial attacks, data poisoning, or explore for other vulnerabilities in the model more effectively.
References
Expand
- 1. Tramèr F. et al. Stealing machine learning models via prediction {APIs} //25th USENIX security symposium (USENIX Security 16). – 2016. – С. 601-618.
- 2. Jagielski M. et al. High accuracy and high fidelity extraction of neural networks //29th USENIX security symposium (USENIX Security 20). – 2020. – С. 1345-1362.
- 3. Hinton G., Vinyals O., Dean J. Distilling the knowledge in a neural network //arXiv preprint arXiv:1503.02531. – 2015.
- 4. Krishna K. et al. Thieves on sesame street! model extraction of bert-based apis //arXiv preprint arXiv:1910.12366. – 2019.
- 5. Ren P. et al. A survey of deep active learning //ACM computing surveys (CSUR). – 2021. – Т. 54. – №. 9. – С. 1-40.
- 6. Emam Z. A. S. et al. Active learning at the imagenet scale //arXiv preprint arXiv:2111.12880. – 2021.
- 7. Oliver A. et al. Realistic evaluation of deep semi-supervised learning algorithms //Advances in neural information processing systems. – 2018. – Т. 31.
- 8. Chandrasekaran V. et al. Exploring connections between active learning and model extraction //29th USENIX Security Symposium (USENIX Security 20). – 2020. – С. 1309-1326.
- 9. https://hiddenlayer.com/research/models-are-code/
- 10. Tang R. et al. An embarrassingly simple approach for trojan attack in deep neural networks //Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. – 2020. – С. 218-228.
- 11. Gu T., Dolan-Gavitt B., Garg S. Badnets: Identifying vulnerabilities in the machine learning model supply chain //arXiv preprint arXiv:1708.06733. – 2017.
- 12. http://dx.doi.org/10.14722/ndss.2018.23291
- 13. Li Y. et al. Deeppayload: Black-box backdoor attack on deep learning models through neural payload injection //2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). – IEEE, 2021. – С. 263-274.
- 14. Bommasani R. et al. On the opportunities and risks of foundation models //arXiv preprint arXiv:2108.07258. – 2021.