# 210107005_UoL_DSM140_NLP_Text_Classification_CW_Sub_v240107wk.ipynb
# Commentable @ https://colab.research.google.com/drive/1kUTphSV9lHhbu_HT_tvffIPEtFWpFPIg?usp=sharing
The domain-specific area of interest is the application of AI and machine learning techniques for Static Application Security Testing (SAST) and vulnerability detection in critical infrastructure software. This area is particularly relevant to the Artificial Intelligence Cyber Challenge (AIxCC), a competition that encourages the development of AI systems to secure critical code.
In our interconnected world, software underpins everything from financial systems to public utilities. As this code enables modern life and drives productivity, it also creates an expanding attack surface for malicious actors. The AIxCC is a two-year competition that asks the best and brightest in AI and cybersecurity to defend the software on which almost everyone rely. The competition will award a cumulative 30 million in prizes to teams with the best systems, including 7 million in prizes to small businesses to empower entrepreneurial innovation.
The AIxCC is particularly focused on securing open-source software, which comprises most of the code running on critical infrastructure today, including the electricity and telecommunications sectors. The competition is collaborating closely with the open-source community to guide teams in creating AI systems capable of addressing vital cybersecurity issues.
The challenge is to develop innovative systems guided by AI and Machine Learning to semi-automatically find and fix software vulnerabilities [2]. The AIxCC competition will foster innovative research via a gamified environment that challenges the competitors to design Cyber Reasoning Systems (CRSs) that integrate novel AI [4].
In the context of C6AI (Combined C++ Code Cybersecurity & CWE-based Classification AI), the focus is on using text classification methods to analyse and classify C++ code for potential vulnerabilities. This involves converting the raw text of the code into numerical feature vectors that can be processed by machine learning algorithms. Techniques such as text stemming and n-gram tokenization are used in this preprocessing stage.
In summary, the domain-specific area is the intersection of AI, cybersecurity, and software vulnerability detection, with a particular focus on static analysis of C++ code. The goal is to develop AI systems that can effectively identify and address software vulnerabilities, thereby enhancing the security of critical infrastructure.
Ref: [1] https://aicyberchallenge.com/about [2] https://www.sbir.gov/node/2464965 [3] https://www.darpa.mil/news-events/2023-12-14 [4] https://openssf.org/blog/2023/12/19/deconstructing-the-ai-cyber-challenge-aixcc/
The Juliet C/C++ 1.3.1 SARD dataset is a collection of test cases in the C/C++ language, organized under 118 different Common Weakness Enumerations (CWEs). This dataset is part of the Software Assurance Reference Dataset (SARD) provided by the National Institute of Standards and Technology (NIST) as ‘Juliet C/C++ 1.3 with extra support’ @ https://samate.nist.gov/SARD/test-suites/116.
The dataset is designed to test software for potential vulnerabilities and weaknesses. Each test case in the dataset is associated with a specific CWE, which represents a type of software vulnerability. The dataset includes both 'good' and 'bad' examples, with the 'bad' examples demonstrating the vulnerability and the 'good' examples showing a correct or safe way to write the code.
The Juliet C/C++ 1.3.1 SARD dataset is publicly available and not subject to copyright protection. It is made available under the CC0 1.0 Public Domain License.
The dataset is structured in a way that each CWE has its own directory, and within each directory, there are multiple text files, each representing a test case. The test cases are labelled with the CWE-ID, which can be used to identify the type of vulnerability that the test case is associated with.
The 671 MB combressed size of the Juliet C/C++ 1.3 SARD with extra support, contains over 64,099 test cases. Given that the SARD has over 170,000 programs and the Juliet C/C++ 1.3 dataset is a part of this collection, it can be inferred that the dataset is quite large. The data types in the dataset are primarily text, as the test cases are represented as C/C++ code in text files.
The dataset is typically used in machine learning experiments, where it is divided into training, validation, and test sets. The SARD dataset has already been divided into training and test sets, but it lacks a validation set. Therefore, it is common practice to create a validation set using an 80:20 split of the training data.
The objectives of the project are to enhance the out-of-sample generalization capabilities of the currently developed C6AI Cyber Reasoning System (CRS) and to measure its 'Vulnerability Discovery Accuracy'. This is in line with the AIxCC CRS Areas of Excellence, which emphasize the importance of developing systems that can accurately identify vulnerabilities in software, particularly in the context of critical infrastructure.
The project aims to contribute to the AI Cyber Challenge (AIxCC) by developing a CRS that can effectively and efficiently detect vulnerabilities in C++ code. The focus on out-of-sample generalization is crucial because it ensures that the system can perform well on new, unseen data, which is a common scenario in real-world applications. The ability to generalize well is indicative of a system's robustness and its potential to adapt to evolving cybersecurity threats.
The impact of achieving these objectives is significant. By improving the accuracy of vulnerability discovery, the project directly contributes to the security of critical infrastructure software. This has far-reaching implications for national security, economic stability, and public safety, as critical infrastructure systems are essential to the functioning of society.
Moreover, the project's contributions to the AIxCC challenge could lead to advancements in the field of AI and cybersecurity. By participating in the gamified environment of the competition, the project fosters innovation and encourages the development of new techniques and methodologies in AI-driven cybersecurity.
The potential contributions of the results to the AIxCC challenge could include:
In summary, the project's objectives are to develop a CRS that excels in out-of-sample generalization and vulnerability discovery accuracy, with the potential to make significant contributions to the AIxCC challenge and the broader field of cybersecurity.
The evaluation methodology for the project will involve several key metrics to assess the performance of the C6AI Cyber Reasoning System (CRS) in identifying vulnerabilities in C++ code. These metrics will provide a comprehensive understanding of the system's performance, including its ability to correctly identify vulnerabilities (accuracy), its ability to correctly identify true vulnerabilities (precision), its ability to identify all actual vulnerabilities (recall), and a balanced measure of precision and recall (F-measure).
Accuracy: This is the most intuitive performance measure, and it simply is a ratio of correctly predicted observation to the total observations. It is the ability of the model to correctly identify both vulnerabilities and non-vulnerabilities. It is calculated as (True Positives + True Negatives) / (True Positives + False Positives + True Negatives + False Negatives).
Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positives. It is also called Positive Predictive Value. It is a measure of amongst all the identified vulnerabilities, how many of them are vulnerabilities. It is calculated as True Positives / (True Positives + False Positives).
Recall (Sensitivity): Recall is the ratio of correctly predicted positive observations to all observations in actual class. It is also called Sensitivity, Hit Rate, or True Positive Rate. It is a measure of the ability of the model to identify all possible vulnerabilities. It is calculated as True Positives / (True Positives + False Negatives).
F-Measure (F1 Score): F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. It is suitable for uneven class distribution problems. It is calculated as 2(Recall Precision) / (Recall + Precision).
The evaluation will be conducted using a test set that the model has not been trained on to ensure an unbiased assessment of the model's performance. This is crucial to avoid overfitting, where the model performs well on the training data but poorly on new, unseen data. The test data will be representative of the real-world data the model will encounter, ensuring the evaluation reflects the model's true predictive performance.
In addition, the project will employ techniques such as cross-validation to further ensure the robustness of the evaluation. In n-Fold cross-validation, the data is divided into n non-overlapping subsets. The model is trained on n-1 subsets and tested on the remaining subset. This process is repeated n times, with each subset used once as the test set. The error estimation is averaged over all n trials to get the total accuracy of the model.
The evaluation methodology will provide a comprehensive understanding of the model's performance, allowing for the identification of areas of strength and potential improvement. This will ultimately contribute to the development of a more accurate and robust Cyber Reasoning System.
The pre-processing steps for the text classification task in the provided Python file include several steps to convert the raw text data into a format that can be used by machine learning algorithms.
Text Lowercasing: All the text is converted to lower case. This is done to ensure that the algorithm does not treat the same words in different cases as different words.
Punctuation Removal: All punctuation marks are removed from the text. Punctuation does not add any extra information while training the machine learning model. Moreover, removing punctuation reduces the size of the vocabulary and thus increases the speed of training.
Stop Words Removal: Stop words are the most common words in a language like 'the', 'a', 'on', 'is', 'all'. These words do not carry important meaning and are usually removed from texts. The Python file uses a list of English stop words from the NLTK library.
Stemming: Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. The Python file uses the Snowball Stemmer from the NLTK library.
N-gram Tokenization: The text is tokenized into n-grams. N-grams are contiguous sequences of n items from a given sample of text or speech. This helps to capture the context and semantic meanings of phrases.
Vectorization: The tokenized text is then converted into numerical vectors which can be used as input to the machine learning algorithm. The Python file uses the bag-of-words model to convert the text into vectors. The bag-of-words model represents each text as a vector in a high-dimensional space, where each unique word in the text is represented by one dimension, and the value in that dimension represents the frequency of the word in the text.
The Python file reads .cpp files as text into a pandas dataframe. The vocabulary is built from the unique words in the text after applying the pre-processing steps.
The Naive Bayes classifier was chosen as the baseline for the C6AI Cyber Reasoning System (CRS) project due to its simplicity, efficiency, and proven effectiveness in text classification tasks. This classifier was implemented using the SciKit Learn library, as shown in the attached Python file.
The Naive Bayes classifier was selected as the baseline because it is a well-established algorithm in the field of text classification and has been used extensively in previous research, including by our lecturer for Statistical Data Mining Dr. Noureddin Sadawi. It is a probabilistic classifier that makes use of Bayes' theorem with strong independence assumptions between the features. It is particularly suited for high-dimensional datasets, like text data, and is known for its efficiency and scalability.
The 0.74 (+/- 0.03) MultinomialNB Accuracy performance of the Naive Bayes classifier provides a meaningful benchmark for comparison with the more complex Convolutional Neural Network (CNN) model. The CNN model, implemented using the Keras library, is expected to outperform the Naive Bayes classifier due to its ability to capture local dependencies in the data and its capacity for hierarchical feature learning. However, the Naive Bayes classifier provides a valuable point of reference to evaluate the degree of improvement achieved with the CNN model.
In conclusion, the Naive Bayes classifier was chosen as the baseline due to its simplicity, efficiency, and proven effectiveness in text classification tasks. Its performance provides a meaningful benchmark for comparison with the more complex CNN model.
# precision recall f1-score support
# CWE121_Stack_Based_Buffer_Overflow 0.91 0.71 0.80 324
# CWE122_Heap_Based_Buffer_Overflow 0.87 0.66 0.75 316
# CWE124_Buffer_Underwrite 0.63 0.85 0.72 331
# CWE126_Buffer_Overread 0.86 0.92 0.89 335
# CWE127_Buffer_Underread 0.92 0.77 0.83 333
# CWE134_Uncontrolled_Format_String 0.98 0.90 0.94 350
# CWE190_Integer_Overflow 0.94 0.79 0.86 298
# CWE191_Integer_Underflow 0.95 0.77 0.85 294
# CWE194_Unexpected_Sign_Extension 1.00 0.59 0.74 165
# CWE195_Signed_to_Unsigned_Conversion_Error 0.99 0.55 0.71 158
# CWE197_Numeric_Truncation_Error 0.71 0.90 0.79 350
# CWE23_Relative_Path_Traversal 0.91 0.98 0.95 350
# CWE369_Divide_by_Zero 0.83 0.91 0.87 350
# CWE36_Absolute_Path_Traversal 0.82 0.91 0.86 350
# CWE400_Resource_Exhaustion 1.00 0.79 0.89 156
# CWE401_Memory_Leak 0.81 0.82 0.81 333
# CWE415_Double_Free 0.76 0.76 0.76 350
# CWE457_Use_of_Uninitialized_Variable 0.94 0.97 0.96 297
# CWE563_Unused_Variable 0.81 1.00 0.89 350
# CWE590_Free_Memory_Not_on_Heap 0.88 0.88 0.88 348
# CWE680_Integer_Overflow_to_Buffer_Overflow 0.98 0.85 0.91 301
# CWE690_NULL_Deref_From_Return 1.00 0.31 0.47 167
# CWE762_Mismatched_Memory_Management_Routines 0.79 0.85 0.82 349
# CWE789_Uncontrolled_Mem_Alloc 0.57 0.97 0.72 323
# CWE78_OS_Command_Injection 1.00 0.95 0.97 350
# accuracy 0.84 7628
# macro avg 0.87 0.81 0.83 7628
# weighted avg 0.86 0.84 0.84 7628
The C6AI Cyber Reasoning System (CRS) project used a Naive Bayes classifier for text classification, specifically for identifying vulnerabilities in software code. The features used in the classifier were derived from the 'Test-Case-Code' and the labels were the 'CWE-ID'.
The 'Test-Case-Code' features were chosen because they represent the actual code snippets that could potentially contain vulnerabilities. These features were transformed into a bag-of-words representation and then weighted using TF-IDF (Term Frequency-Inverse Document Frequency). This transformation was crucial in converting the raw text into a numerical format that the classifier could process.
The 'CWE-ID' [a standard identifier for software vulnerabilities, allowing the results to be easily interpreted] was used as the target label because it represents the specific type of vulnerability present in the code. The classifier was trained to predict this label based on the features derived from the 'Test-Case-Code'.
The Naive Bayes classifier was chosen for its simplicity and efficiency in text classification tasks. It was implemented using the SciKit Learn library for the baseline model. For the final model, a Convolutional Neural Network (CNN) was built using the Keras library. CNNs are known for their effectiveness in text classification tasks, as they can capture local dependencies in the text and can manage variable-length inputs.
The Python script used for training and evaluating the classifier was designed to be easily understood and modified, enhancing the project's reproducibility. This means that the approach can be replicated by others using different programming languages, development environments, libraries, and algorithms.
The Python code provided is to adhere to several key coding conventions, which are crucial for maintaining high-quality, readable, and maintainable code.
Indentation: The code uses consistent indentation, which is a fundamental aspect of Python syntax and crucial for code readability.
Variable Naming: The code uses meaningful names
for variables, which makes the code more understandable and
maintainable. For example, porter_stemmer
,
stop_words
, and global_start
are all
descriptive variable names that give a clear indication of their purpose
in the code.
Use of Libraries: The code makes extensive use
of libraries, including nltk
, tensorflow
,
keras
, numpy
, pandas
, and
sklearn
, among others. This is a good practice as it
leverages existing, well-tested functionality and can make the code more
concise and efficient.
Comments: The code includes numerous comments, which are essential for explaining the purpose of code blocks, the functionality of functions, and the meaning of variables. This is a good practice as it makes the code more understandable for others (and for the original coder at a later date).
Avoiding Magic Numbers: The code defines several
constants at the beginning (like epochs
,
batch_size
, seed
, etc.), which is a good
practice as it avoids the use of unnamed numerical constants ("magic
numbers") in the code. This makes the code more readable and easier to
modify.
Code Organization: The code is well-organized, with clear sections for importing libraries, setting up variables, defining functions, and executing code. This organization makes the code easier to follow and understand.
In summary, the code in the provided Python file appears to follow good coding practices, including consistent indentation, meaningful variable names, extensive use of libraries, comprehensive comments, avoidance of magic numbers, and clear organization. These practices contribute to making the code high-quality, readable, and maintainable.
The evaluation of the C6AI Cyber Reasoning System (CRS) classifier was performed using the Python scripts provided in this notebook. Those scripts finally use a baseline-beating CNN model after initially using multiple common-sense models and statistical data mining algorithms [starting with the Naive Bayes classifier algorithm] to train various baseline models and then makes predictions on the entire dataset.
The script uses the following metrics for evaluation:
Accuracy: This metric measures the ratio of correctly predicted observations to the total observations. It is the ability of the model to correctly identify both vulnerabilities and non-vulnerabilities.
Precision: This metric measures the ratio of correctly predicted positive observations to the total predicted positives. It is a measure of amongst all the identified vulnerabilities, how many of them are vulnerabilities.
Recall (Sensitivity): This metric measures the ratio of correctly predicted positive observations to all observations in the actual class. It is a measure of the ability of the model to identify all possible vulnerabilities.
F-Measure (F1 Score): This metric is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.
The script uses the SciKit Learn's built-in classification report to return these metrics.
The results of the evaluation provide a quantitative measure of the performance of the CRS classifier. By comparing these results with a suitable baseline, we can assess the improvement achieved by our approach. The specific values of these metrics depended on the actual data used for training and testing the classifier [as shown and validated in the rest of the notebook].
# Evaluated the model on the test set
## 87/87 [==============================] - 1s 11ms/step - loss: 0.2339 - accuracy: 0.9259
## Test loss: 0.23388110101222992
## Test accuracy: 0.9259259104728699
Baseline-beating CNN Model Test Accuracy was 0.925 (in contrast to the 0.74 (+/- 0.03) MultinomialNB Accuracy or the Accuracy: 0.87 (+/- 0.02) RandomForestClassifier Accuracy); nevertheless, the C6AI Cyber Reasoning System (CRS) project once further developed [using creative advances out of scope for this elementary NLP assignment] could make significant contributions to the field of text classification, particularly in the context of identifying vulnerabilities in software code. While the project ultimately employed a CNN model [amongst others], its initial choice was a Naive Bayes classifier, as a popular choice for text classification tasks due to its simplicity and efficiency. The classifier was trained and evaluated using Python-based SciKit Learn's scripts, which were designed to be easily understood and modified, enhancing the project's reproducibility.
The project's preprocessing steps, including the transformation of text into a bag-of-words representation and the use of TF-IDF weighting, were crucial in preparing the data for the classifier. These steps converted the raw text into a format that the classifier could process, and they could be readily adapted for other text classification tasks in different domains.
The CRS classifier demonstrated robust performance across several evaluation metrics, including accuracy, precision, recall, and F1 score. These metrics provide a comprehensive assessment of the classifier's performance, considering both its ability to correctly identify vulnerabilities and its ability to avoid false positives and false negatives.
The project's approach is highly transferable to other domain-specific areas that involve text classification. The preprocessing steps and the Naive Bayes classifier can be applied to any text data, provided that the data is labelled for supervised learning. Furthermore, the Python script can be easily modified to accommodate different data sources, classification algorithms, or evaluation metrics.
The project's approach can also be replicated using different programming languages, development environments, libraries, and algorithms. The key steps of the approach, including text preprocessing, classifier training, and performance evaluation, are common tasks in machine learning and natural language processing, and they can be implemented in many programming languages that support these tasks, such as R, Java, or C++. Similarly, different development environments or libraries [such the new KerasNLP or Jax/PyTorch-backed Keras 3.0 API] can be used to provide the necessary functionalities for these tasks.
While the Naive Bayes classifier baseline was effective, the more complex CNN model achieves higher performance on related tasks or datasets. Alas, such model also has drawbacks, such as increased computational cost and the risk of overfitting. Therefore, the choice of classifier should be guided by the specific requirements and constraints of each task.
='/content/'
CONTENT_PATH# !cd $CONTENT_PATH && ls
!ls
!rm -r sample_data
sample_data
# import codecs,collections,csv,glob,io,itertools,json,logging,nltk,pathlib,\
# pickle,pprint,pytest,re,requests,shutil,string,sys,unicodedata,warnings,zipfile
import codecs,collections,csv,glob,io,itertools,json,logging,nltk,os,pathlib,\
pickle,pprint,pytest,re,requests,shutil,string,sys,time,unicodedata,warnings,zipfile
# import time
## from time import time
= time.time() global_start
## TensorFlow backend only supports string inputs
# import os
"KERAS_BACKEND"] = "tensorflow"
os.environ[import keras
from keras import layers
!pip install -q "tensorflow-text" # ==2.13.*"
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.2/5.2 MB 9.8 MB/s eta 0:00:00
# !pip install -q "tensorflow-text" # ==2.13.*"
import tensorflow_text as tf_text
# import keras
import tensorflow as tf
import tensorflow.data as tf_data
import tensorflow_datasets as tfds
# from keras import layers
from tensorflow.keras import Model
# from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import MaxPooling1D
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.layers import concatenate
from tensorflow.keras.utils import plot_model
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns; sns.set()
import sklearn.feature_extraction.text
import sklearn.metrics
from matplotlib import pyplot as plt
from pandas.core.frame import DataFrame
from numpy.testing import assert_array_equal
from google.colab import files
from tempfile import NamedTemporaryFile
from tqdm.notebook import tqdm
from typing import KeysView
from os import listdir
from os.path import isfile, join
from operator import itemgetter
from optparse import OptionParser
from collections import Counter
from collections import defaultdict
from collections import namedtuple
from collections import OrderedDict
from scipy.cluster.hierarchy import dendrogram
from scipy.sparse import csr_matrix
from scipy.special import logit
from scipy.stats.distributions import uniform
from gensim.models import word2vec
from gensim.models import Word2Vec
from gensim.models.phrases import Phraser
from gensim.models.phrases import Phrases
from keras import initializers, regularizers, constraints, optimizers, layers
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from nltk import ngrams
# from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from sklearn import datasets
from sklearn import metrics
from sklearn import preprocessing
from sklearn.base import BaseEstimator
from sklearn.base import RegressorMixin
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import dump_svmlight_file
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets import load_files
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.exceptions import NotFittedError
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import check_array
from sklearn.utils import check_X_y
from sklearn.utils import shuffle
from sklearn.utils.extmath import density
from sklearn.utils.extmath import log_logistic
from sklearn.utils.multiclass import unique_labels
## Importing basic python libraries
%matplotlib inline
from __future__ import print_function
'punkt')
nltk.download('stopwords')
nltk.download('wordnet') nltk.download(
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
True
=15 #30 #10 #2 #10
epochs=25 #4
new_num_labels= 32
batch_size = 0 # 42
seed = 10000
VOCAB_SIZE = 250
MAX_SEQUENCE_LENGTH
# Global values.
=10000
WORDS_SIZE=500
INPUT_SIZE=new_num_labels #5 # 2 # NUM_CLASSES=2
NUM_CLASSES=0
MODEL_NUM=epochs #15 #10
EPOCHS
# Preprocessing params.
= 128
PRETRAINING_BATCH_SIZE = 32
FINETUNING_BATCH_SIZE = 128
SEQ_LENGTH = 0.25
MASK_RATE = 32
PREDICTIONS_PER_SEQ
# Model params.
= 3
NUM_LAYERS = 256
MODEL_DIM = 512
INTERMEDIATE_DIM = 4
NUM_HEADS = 0.1
DROPOUT = 1e-5
NORM_EPSILON
# Training params.
= 5e-4
PRETRAINING_LEARNING_RATE = 8
PRETRAINING_EPOCHS = 5e-5
FINETUNING_LEARNING_RATE = 3 FINETUNING_EPOCHS
# Generate random seed
#myrand=np.random.randint(1, 99999 + 1)
=seed # 1234 # 71926
rand
np.random.seed(rand)
tf.random.set_seed(rand)print("Random seed is:", rand)
Random seed is: 0
print("Tensorlfow version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")
## Tensorlfow version: 2.13.1
## Eager mode: True
## GPU is NOT AVAILABLE
## Tensorlfow version: 2.15.0
## Eager mode: True
## T4 GPU is available
## High RAM
WARNING:tensorflow:From <ipython-input-16-b8ab0bb2f411>:3: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
Tensorlfow version: 2.15.0
Eager mode: True
GPU is NOT AVAILABLE
# Baseline/Common-sense MODELs
=PorterStemmer()
porter_stemmer# stemmer = PorterStemmer()
=SnowballStemmer("english",ignore_stopwords=True)
stemmer
def n_vect(text_list, n=3):
= []
n_vect for item in ngrams(text_list,n):
' '.join(item))
n_vect.append(return n_vect
def process_text(text,n=1):
## Check if text is a string
if not isinstance(text, str):
= str(text)
text
## Initiating the tokenised text as an empty list
=[]
tokenised=WordNetLemmatizer()
word_lemmatize=stopwords.words('english')
stop_words
## Converting text to lower case ## removing all punctuation
#1. Convert text to lower case and remove all
=text.lower()
text# nopunc=[char for char in text if char not in string.punctuation]
=[char for char in text if char not in string.punctuation]
nopunc=''.join(nopunc)
nopunc
#2. Remove all stopwords
=[word for word in nopunc.split() if word.lower() not in stop_words]
removed_stop_words
#3. Apply Stemming
=[stemmer.stem(word) for word in removed_stop_words]
stemming
#4. Apply Ngram Tokenisation
=n_vect(stemming,n)
tokenised
#5. Remove non-UTF-8 characters
= [word.encode("utf-8", "ignore").decode("utf-8") for word in tokenised]
tokenised
# # [OPTIONAL] Applying stemming
# tokenised=[porter_stemmer.stem(tokenised)] for worg in nltk.word_tokenize(tokenised)] #! text
# tokenised=[word_lemmatize.lemmatize(tokenised) for word in tokenised if word not in stop_words]
# tokenised.append([word for word in tokenised.split() if word not in stop_words])
# vect=sklearn.feature_extraction.text.CountVectorizer(ngram_range=(n,n+1))
# vect.fit(tokenised) # word #! text #! tokenised #! token
# vect.get_feature_names()
## Returning the tokenised text as a list
#6. Returns the tokenised text as a list
return tokenised # .strip()
## Testing on a simple string
"Here we're testing the process_text function, results are as follows:", n=3) # , n=1 process_text(
['test processtext function',
'processtext function result',
'function result follow']
"""
## Download the SARD cpp_8750_files data
"""
# !gdown 1Q_P8bYpvdSEbp6NnCzfqU3lwQwxUlfE3
# !wget https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
{"type":"string"}
# %%time
= 'https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz'
data_url
= utils.get_file(
dataset_dir =data_url,
origin=True,
untar='cache_dir',
cache_dir='')
cache_subdir
= pathlib.Path(dataset_dir).parent
dataset_dir ## CPU times: user 366 µs, sys: 0 ns, total: 366 µs
## Wall time: 348 µs-
Downloading data from https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
895869/895869 [==============================] - 0s 0us/step
list(dataset_dir.iterdir())
[PosixPath('/tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted'),
PosixPath('/tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz')]
# Load text files with categories as subfolder names.
# data = datasets.load_files('20news-bydate-test')
= datasets.load_files('/tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted') data
## cd /tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
# !pwd ## /content
!ls -lh /tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
total 464K
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE121_Stack_Based_Buffer_Overflow
drwxr-xr-x 2 root root 16K Jan 2 08:51 CWE122_Heap_Based_Buffer_Overflow
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE124_Buffer_Underwrite
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE126_Buffer_Overread
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE127_Buffer_Underread
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE134_Uncontrolled_Format_String
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE190_Integer_Overflow
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE191_Integer_Underflow
drwxr-xr-x 2 root root 12K Jan 2 04:10 CWE194_Unexpected_Sign_Extension
drwxr-xr-x 2 root root 12K Jan 2 04:10 CWE195_Signed_to_Unsigned_Conversion_Error
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE197_Numeric_Truncation_Error
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE23_Relative_Path_Traversal
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE369_Divide_by_Zero
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE36_Absolute_Path_Traversal
drwxr-xr-x 2 root root 12K Jan 2 04:10 CWE400_Resource_Exhaustion
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE401_Memory_Leak
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE415_Double_Free
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE457_Use_of_Uninitialized_Variable
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE563_Unused_Variable
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE590_Free_Memory_Not_on_Heap
drwxr-xr-x 2 root root 16K Jan 2 04:10 CWE680_Integer_Overflow_to_Buffer_Overflow
drwxr-xr-x 2 root root 12K Jan 2 04:10 CWE690_NULL_Deref_From_Return
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE762_Mismatched_Memory_Management_Routines
drwxr-xr-x 2 root root 20K Jan 2 04:10 CWE789_Uncontrolled_Mem_Alloc
drwxr-xr-x 2 root root 24K Jan 2 04:10 CWE78_OS_Command_Injection
data.keys()
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
# data.target
print(len(data.target)) ## 7628
print(data.target) ## [11 3 18 ... 5 8 8]
= list(np.unique(data.target))
data_targets_list print(len(data_targets_list)) ## 25 ## number of unique targets
print(data_targets_list) ## list of unique targets
## [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
7628
[11 3 18 ... 5 8 8]
25
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
13] data.target_names[
{"type":"string"}
data.target_names
['CWE121_Stack_Based_Buffer_Overflow',
'CWE122_Heap_Based_Buffer_Overflow',
'CWE124_Buffer_Underwrite',
'CWE126_Buffer_Overread',
'CWE127_Buffer_Underread',
'CWE134_Uncontrolled_Format_String',
'CWE190_Integer_Overflow',
'CWE191_Integer_Underflow',
'CWE194_Unexpected_Sign_Extension',
'CWE195_Signed_to_Unsigned_Conversion_Error',
'CWE197_Numeric_Truncation_Error',
'CWE23_Relative_Path_Traversal',
'CWE369_Divide_by_Zero',
'CWE36_Absolute_Path_Traversal',
'CWE400_Resource_Exhaustion',
'CWE401_Memory_Leak',
'CWE415_Double_Free',
'CWE457_Use_of_Uninitialized_Variable',
'CWE563_Unused_Variable',
'CWE590_Free_Memory_Not_on_Heap',
'CWE680_Integer_Overflow_to_Buffer_Overflow',
'CWE690_NULL_Deref_From_Return',
'CWE762_Mismatched_Memory_Management_Routines',
'CWE789_Uncontrolled_Mem_Alloc',
'CWE78_OS_Command_Injection']
= []
data_target_tuples for text,category,category_names in zip(data['data'], data['target'], data['target_names']):
= text.decode("cp1252")
decoded = str.join(" ", decoded.splitlines())
one_line data_target_tuples.append((one_line, category, category_names))
len(data_target_tuples) ## 25
25
= []
data_tuples for text,category in zip(data['data'], data['target']):
= text.decode("cp1252")
decoded = str.join(" ", decoded.splitlines())
one_line data_tuples.append((one_line, category))
# len(tuple_list) ## 10639 ## 10650
len(data_tuples) ## 7628
7628
= pd.DataFrame(data_tuples, columns=['Text','Category'])
df # data=df
df
Text | Category | |
---|---|---|
0 | #include "std_testcase.h" #ifdef _WIN32 #d... | 11 |
1 | #include "std_testcase.h" #include <wchar.... | 3 |
2 | #ifndef OMITGOOD #include "std_testcase.h" ... | 18 |
3 | #include "std_testcase.h" #include "environ... | 24 |
4 | #include "std_testcase.h" #include <wchar.... | 4 |
... | ... | ... |
7623 | #include "std_testcase.h" #include <wchar.... | 16 |
7624 | #include "std_testcase.h" #ifdef _WIN32 #d... | 11 |
7625 | #include "std_testcase.h" #ifndef _WIN32 #... | 5 |
7626 | #include "std_testcase.h" #include <vector>... | 8 |
7627 | #include "std_testcase.h" #include <map> u... | 8 |
7628 rows × 2 columns
df
Text | Category | |
---|---|---|
0 | #include "std_testcase.h" #ifdef _WIN32 #d... | 11 |
1 | #include "std_testcase.h" #include <wchar.... | 3 |
2 | #ifndef OMITGOOD #include "std_testcase.h" ... | 18 |
3 | #include "std_testcase.h" #include "environ... | 24 |
4 | #include "std_testcase.h" #include <wchar.... | 4 |
... | ... | ... |
7623 | #include "std_testcase.h" #include <wchar.... | 16 |
7624 | #include "std_testcase.h" #ifdef _WIN32 #d... | 11 |
7625 | #include "std_testcase.h" #ifndef _WIN32 #... | 5 |
7626 | #include "std_testcase.h" #include <vector>... | 8 |
7627 | #include "std_testcase.h" #include <map> u... | 8 |
7628 rows × 2 columns
## (10649, 4)
df.shape ## 10649 rows × 4 columns df
Text | Category | |
---|---|---|
0 | #include "std_testcase.h" #ifdef _WIN32 #d... | 11 |
1 | #include "std_testcase.h" #include <wchar.... | 3 |
2 | #ifndef OMITGOOD #include "std_testcase.h" ... | 18 |
3 | #include "std_testcase.h" #include "environ... | 24 |
4 | #include "std_testcase.h" #include <wchar.... | 4 |
... | ... | ... |
7623 | #include "std_testcase.h" #include <wchar.... | 16 |
7624 | #include "std_testcase.h" #ifdef _WIN32 #d... | 11 |
7625 | #include "std_testcase.h" #ifndef _WIN32 #... | 5 |
7626 | #include "std_testcase.h" #include <vector>... | 8 |
7627 | #include "std_testcase.h" #include <map> u... | 8 |
7628 rows × 2 columns
# dtype: object
df.dtypes # data.shape ## (10649, 4)
Text object
Category int64
dtype: object
# df[['Text',"Category"]].describe()
df.describe()
Category | |
---|---|
count | 7628.000000 |
mean | 11.998820 |
std | 7.343498 |
min | 0.000000 |
25% | 5.000000 |
50% | 12.000000 |
75% | 18.000000 |
max | 24.000000 |
# df[['Text',"Category"]].value_counts()
print(df.value_counts())
# Text Category
# 0 10 1
# 5036 10 1
# 5048 8 1
# 5047 9 1
# 5046 8 1
# ..
# 2531 2 1
# 2530 3 1
# 2529 3 1
# 2528 4 1
# 7565 5 1
# Length: 7628, dtype: int64
Text Category
#include "std_testcase.h" #define CHAR_ARRAY_SIZE 8 namespace fgets_33 { #ifndef OMITBAD void bad() { short data; short &dataRef = data; data = -1; { char inputBuffer[CHAR_ARRAY_SIZE] = ""; if (fgets(inputBuffer, CHAR_ARRAY_SIZE, stdin) != NULL) { data = (short)atoi(inputBuffer); } else { printLine("fgets() failed."); } } { short data = dataRef; { char charData = (char)data; printHexCharLine(charData); } } } #endif #ifndef OMITGOOD static void goodG2B() { short data; short &dataRef = data; data = -1; data = CHAR_MAX-5; { short data = dataRef; { char charData = (char)data; printHexCharLine(charData); } } } void good() { goodG2B(); } #endif } #ifdef INCLUDEMAIN using namespace fgets_33; int main(int argc, char * argv[]) { srand( (unsigned)time(NULL) ); #ifndef OMITGOOD printLine("Calling good()..."); good(); printLine("Finished good()"); #endif #ifndef OMITBAD printLine("Calling bad()..."); bad(); printLine("Finished bad()"); #endif return 0; } #endif 10 1
#include "std_testcase.h" #include <map> using namespace std; namespace rand_to_char_74 { #ifndef OMITBAD void badSink(map<int, int> dataMap); void bad() { int data; map<int, int> dataMap; data = -1; data = RAND32(); dataMap[0] = data; dataMap[1] = data; dataMap[2] = data; badSink(dataMap); } #endif #ifndef OMITGOOD void goodG2BSink(map<int, int> dataMap); static void goodG2B() { int data; map<int, int> dataMap; data = -1; data = CHAR_MAX-5; dataMap[0] = data; dataMap[1] = data; dataMap[2] = data; goodG2BSink(dataMap); } void good() { goodG2B(); } #endif } #ifdef INCLUDEMAIN using namespace rand_to_char_74; int main(int argc, char * argv[]) { srand( (unsigned)time(NULL) ); #ifndef OMITGOOD printLine("Calling good()..."); good(); printLine("Finished good()"); #endif #ifndef OMITBAD printLine("Calling bad()..."); bad(); printLine("Finished bad()"); #endif return 0; } #endif 10 1
#include "std_testcase.h" #include <map> using namespace std; namespace socket_strncpy_74 { #ifndef OMITBAD void badSink(map<int, short> dataMap) { short data = dataMap[2]; { char source[100]; char dest[100] = ""; memset(source, 'A', 100-1); source[100-1] = '\0'; if (data < 100) { strncpy(dest, source, data); dest[data] = '\0'; } printLine(dest); } } #endif #ifndef OMITGOOD void goodG2BSink(map<int, short> dataMap) { short data = dataMap[2]; { char source[100]; char dest[100] = ""; memset(source, 'A', 100-1); source[100-1] = '\0'; if (data < 100) { strncpy(dest, source, data); dest[data] = '\0'; } printLine(dest); } } #endif } 8 1
#include "std_testcase.h" #include <map> using namespace std; namespace socket_strncpy_74 { #ifndef OMITBAD void badSink(map<int, int> dataMap) { int data = dataMap[2]; { char source[100]; char dest[100] = ""; memset(source, 'A', 100-1); source[100-1] = '\0'; if (data < 100) { strncpy(dest, source, data); dest[data] = '\0'; } printLine(dest); } } #endif #ifndef OMITGOOD void goodG2BSink(map<int, int> dataMap) { int data = dataMap[2]; { char source[100]; char dest[100] = ""; memset(source, 'A', 100-1); source[100-1] = '\0'; if (data < 100) { strncpy(dest, source, data); dest[data] = '\0'; } printLine(dest); } } #endif } 9 1
#include "std_testcase.h" #include <map> using namespace std; namespace socket_memmove_74 { #ifndef OMITBAD void badSink(map<int, short> dataMap) { short data = dataMap[2]; { char source[100]; char dest[100] = ""; memset(source, 'A', 100-1); source[100-1] = '\0'; if (data < 100) { memmove(dest, source, data); dest[data] = '\0'; } printLine(dest); } } #endif #ifndef OMITGOOD void goodG2BSink(map<int, short> dataMap) { short data = dataMap[2]; { char source[100]; char dest[100] = ""; memset(source, 'A', 100-1); source[100-1] = '\0'; if (data < 100) { memmove(dest, source, data); dest[data] = '\0'; } printLine(dest); } } #endif } 8 1
..
#include "std_testcase.h" #include <wchar.h> namespace wchar_t_memmove_53 { #ifndef OMITBAD void badSink_c(wchar_t * data); void badSink_b(wchar_t * data) { badSink_c(data); } #endif #ifndef OMITGOOD void goodG2BSink_c(wchar_t * data); void goodG2BSink_b(wchar_t * data) { goodG2BSink_c(data); } #endif } 2 1
#include "std_testcase.h" #include <wchar.h> namespace wchar_t_memmove_52 { #ifndef OMITBAD void badSink_b(wchar_t * data); void bad() { wchar_t * data; data = NULL; data = new wchar_t[50]; wmemset(data, L'A', 50-1); data[50-1] = L'\0'; badSink_b(data); } #endif #ifndef OMITGOOD void goodG2BSink_b(wchar_t * data); static void goodG2B() { wchar_t * data; data = NULL; data = new wchar_t[100]; wmemset(data, L'A', 100-1); data[100-1] = L'\0'; goodG2BSink_b(data); } void good() { goodG2B(); } #endif } #ifdef INCLUDEMAIN using namespace wchar_t_memmove_52; int main(int argc, char * argv[]) { srand( (unsigned)time(NULL) ); #ifndef OMITGOOD printLine("Calling good()..."); good(); printLine("Finished good()"); #endif #ifndef OMITBAD printLine("Calling bad()..."); bad(); printLine("Finished bad()"); #endif return 0; } #endif 3 1
#include "std_testcase.h" #include <wchar.h> namespace wchar_t_memmove_52 { #ifndef OMITBAD void badSink_c(wchar_t * data) { { wchar_t dest[100]; wmemset(dest, L'C', 100-1); dest[100-1] = L'\0'; memmove(dest, data, wcslen(dest)*sizeof(wchar_t)); dest[100-1] = L'\0'; printWLine(dest); delete [] data; } } #endif #ifndef OMITGOOD void goodG2BSink_c(wchar_t * data) { { wchar_t dest[100]; wmemset(dest, L'C', 100-1); dest[100-1] = L'\0'; memmove(dest, data, wcslen(dest)*sizeof(wchar_t)); dest[100-1] = L'\0'; printWLine(dest); delete [] data; } } #endif } 3 1
#include "std_testcase.h" #include <wchar.h> namespace wchar_t_memmove_52 { #ifndef OMITBAD void badSink_c(wchar_t * data) { { wchar_t dest[100]; wmemset(dest, L'C', 100-1); dest[100-1] = L'\0'; memmove(dest, data, 100*sizeof(wchar_t)); dest[100-1] = L'\0'; printWLine(dest); } } #endif #ifndef OMITGOOD void goodG2BSink_c(wchar_t * data) { { wchar_t dest[100]; wmemset(dest, L'C', 100-1); dest[100-1] = L'\0'; memmove(dest, data, 100*sizeof(wchar_t)); dest[100-1] = L'\0'; printWLine(dest); } } #endif } 4 1
#include <vector> #include "std_testcase.h" #ifndef _WIN32 #include <wchar.h> #endif using namespace std; namespace t_console_w32_vsnprintf_72 { #ifndef OMITBAD void badSink(vector<wchar_t *> dataVector); void bad() { wchar_t * data; vector<wchar_t *> dataVector; wchar_t dataBuffer[100] = L""; data = dataBuffer; { size_t dataLen = wcslen(data); if (100-dataLen > 1) { if (fgetws(data+dataLen, (int)(100-dataLen), stdin) != NULL) { dataLen = wcslen(data); if (dataLen > 0 && data[dataLen-1] == L'\n') { data[dataLen-1] = L'\0'; } } else { printLine("fgetws() failed"); data[dataLen] = L'\0'; } } } dataVector.insert(dataVector.end(), 1, data); dataVector.insert(dataVector.end(), 1, data); dataVector.insert(dataVector.end(), 1, data); badSink(dataVector); } #endif #ifndef OMITGOOD void goodG2BSink(vector<wchar_t *> dataVector); static void goodG2B() { wchar_t * data; vector<wchar_t *> dataVector; wchar_t dataBuffer[100] = L""; data = dataBuffer; wcscpy(data, L"fixedstringtest"); dataVector.insert(dataVector.end(), 1, data); dataVector.insert(dataVector.end(), 1, data); dataVector.insert(dataVector.end(), 1, data); goodG2BSink(dataVector); } void goodB2GSink(vector<wchar_t *> dataVector); static void goodB2G() { wchar_t * data; vector<wchar_t *> dataVector; wchar_t dataBuffer[100] = L""; data = dataBuffer; { size_t dataLen = wcslen(data); if (100-dataLen > 1) { if (fgetws(data+dataLen, (int)(100-dataLen), stdin) != NULL) { dataLen = wcslen(data); if (dataLen > 0 && data[dataLen-1] == L'\n') { data[dataLen-1] = L'\0'; } } else { printLine("fgetws() failed"); data[dataLen] = L'\0'; } } } dataVector.insert(dataVector.end(), 1, data); dataVector.insert(dataVector.end(), 1, data); dataVector.insert(dataVector.end(), 1, data); goodB2GSink(dataVector); } void good() { goodG2B(); goodB2G(); } #endif } #ifdef INCLUDEMAIN using namespace t_console_w32_vsnprintf_72; int main(int argc, char * argv[]) { srand( (unsigned)time(NULL) ); #ifndef OMITGOOD printLine("Calling good()..."); good(); printLine("Finished good()"); #endif #ifndef OMITBAD printLine("Calling bad()..."); bad(); printLine("Finished bad()"); #endif return 0; } #endif 5 1
Length: 7628, dtype: int64
"Category"].value_counts(5) df[
11 0.045884
16 0.045884
12 0.045884
5 0.045884
10 0.045884
13 0.045884
24 0.045884
18 0.045884
22 0.045752
19 0.045621
3 0.043917
4 0.043655
15 0.043655
2 0.043393
0 0.042475
23 0.042344
1 0.041426
20 0.039460
6 0.039067
17 0.038936
7 0.038542
21 0.021893
8 0.021631
9 0.020713
14 0.020451
Name: Category, dtype: float64
'Category').describe() df.groupby(
Text | ||||
---|---|---|---|---|
count | unique | top | freq | |
Category | ||||
0 | 324 | 324 | #include "std_testcase.h" #include <map> u... | 1 |
1 | 316 | 316 | #include "std_testcase.h" #ifndef _WIN32 #... | 1 |
2 | 331 | 331 | #ifndef OMITGOOD #include "std_testcase.h" ... | 1 |
3 | 335 | 335 | #include "std_testcase.h" #include <wchar.... | 1 |
4 | 333 | 333 | #include "std_testcase.h" #include <wchar.... | 1 |
5 | 350 | 350 | #ifndef OMITBAD #include "std_testcase.h" #... | 1 |
6 | 298 | 298 | #include "std_testcase.h" #include "max_mul... | 1 |
7 | 294 | 294 | #include "std_testcase.h" #include <map> u... | 1 |
8 | 165 | 165 | #include "std_testcase.h" namespace socket... | 1 |
9 | 158 | 158 | #include "std_testcase.h" #include <list> ... | 1 |
10 | 350 | 350 | #include "std_testcase.h" #include <vector>... | 1 |
11 | 350 | 350 | #include "std_testcase.h" #ifdef _WIN32 #d... | 1 |
12 | 350 | 350 | #include "std_testcase.h" #include <vector>... | 1 |
13 | 350 | 350 | #include "std_testcase.h" #ifndef _WIN32 #... | 1 |
14 | 156 | 156 | #ifndef OMITGOOD #include "std_testcase.h" ... | 1 |
15 | 333 | 333 | #include "std_testcase.h" #ifndef _WIN32 #... | 1 |
16 | 350 | 350 | #include "std_testcase.h" #include <wchar.... | 1 |
17 | 297 | 297 | #include "std_testcase.h" namespace int_ar... | 1 |
18 | 350 | 350 | #ifndef OMITGOOD #include "std_testcase.h" ... | 1 |
19 | 348 | 348 | #include "std_testcase.h" #include <wchar.... | 1 |
20 | 301 | 301 | #ifndef OMITGOOD #include "std_testcase.h" ... | 1 |
21 | 167 | 167 | #ifndef OMITBAD #include "std_testcase.h" #... | 1 |
22 | 349 | 349 | #include "std_testcase.h" namespace int_ma... | 1 |
23 | 323 | 323 | #include "std_testcase.h" #ifndef _WIN32 #... | 1 |
24 | 350 | 350 | #include "std_testcase.h" #include "environ... | 1 |
df.head()
Text | Category | |
---|---|---|
0 | #include "std_testcase.h" #ifdef _WIN32 #d... | 11 |
1 | #include "std_testcase.h" #include <wchar.... | 3 |
2 | #ifndef OMITGOOD #include "std_testcase.h" ... | 18 |
3 | #include "std_testcase.h" #include "environ... | 24 |
4 | #include "std_testcase.h" #include <wchar.... | 4 |
df.shape
(7628, 2)
df.describe()
Category | |
---|---|
count | 7628.000000 |
mean | 11.998820 |
std | 7.343498 |
min | 0.000000 |
25% | 5.000000 |
50% | 12.000000 |
75% | 18.000000 |
max | 24.000000 |
=df['Category']
cat_df cat_df.head()
0 11
1 3
2 18
3 24
4 4
Name: Category, dtype: int64
=len(df[['Text',"Category"]])
data_len=df[['Text',"Category"]].count()
data_count data_count
Text 7628
Category 7628
dtype: int64
=data_len-data_count
data_missing data_missing
Text 0
Category 0
dtype: int64
=df.dropna()
data_drop_na'data_drop_na.csv')
data_drop_na.to_csv( data_drop_na.count()
Text 7628
Category 7628
dtype: int64
=data_drop_na.groupby(["Category"])['Text'].nunique()
data_123f
data_123f.count() data_123f
Category
0 324
1 316
2 331
3 335
4 333
5 350
6 298
7 294
8 165
9 158
10 350
11 350
12 350
13 350
14 156
15 333
16 350
17 297
18 350
19 348
20 301
21 167
22 349
23 323
24 350
Name: Text, dtype: int64
= data_drop_na.groupby('Category')
g = g.filter(lambda x: len(x) > 9) # pandas 0.13.1
data_L1_filtered data_L1_filtered.count()
Text 7628
Category 7628
dtype: int64
= data_L1_filtered data_filtered
def process_text_series(series, n=1):
return series.apply(lambda x: process_text(x, n))
'Text'] = process_text_series(data_filtered['Text'], n=1) data_filtered[
'Text']
data_filtered[
# 0 [includ, stdtestcaseh, ifdef, win32, defin, ba...
# 1 [includ, stdtestcaseh, includ, wcharh, namespa...
# 2 [ifndef, omitgood, includ, stdtestcaseh, inclu...
# 3 [includ, stdtestcaseh, includ, environmentw32s...
# 4 [includ, stdtestcaseh, includ, wcharh, namespa...
# ...
# 7623 [includ, stdtestcaseh, includ, wcharh, namespa...
# 7624 [includ, stdtestcaseh, ifdef, win32, defin, ba...
# 7625 [includ, stdtestcaseh, ifndef, win32, includ, ...
# 7626 [includ, stdtestcaseh, includ, vector, defin, ...
# 7627 [includ, stdtestcaseh, includ, map, use, names...
# Name: Text, Length: 7628, dtype: object
0 [includ, stdtestcaseh, ifdef, win32, defin, ba...
1 [includ, stdtestcaseh, includ, wcharh, namespa...
2 [ifndef, omitgood, includ, stdtestcaseh, inclu...
3 [includ, stdtestcaseh, includ, environmentw32s...
4 [includ, stdtestcaseh, includ, wcharh, namespa...
...
7623 [includ, stdtestcaseh, includ, wcharh, namespa...
7624 [includ, stdtestcaseh, ifdef, win32, defin, ba...
7625 [includ, stdtestcaseh, ifndef, win32, includ, ...
7626 [includ, stdtestcaseh, includ, vector, defin, ...
7627 [includ, stdtestcaseh, includ, map, use, names...
Name: Text, Length: 7628, dtype: object
## not df ## nor data data_filtered
Text | Category | |
---|---|---|
0 | [includ, stdtestcaseh, ifdef, win32, defin, ba... | 11 |
1 | [includ, stdtestcaseh, includ, wcharh, namespa... | 3 |
2 | [ifndef, omitgood, includ, stdtestcaseh, inclu... | 18 |
3 | [includ, stdtestcaseh, includ, environmentw32s... | 24 |
4 | [includ, stdtestcaseh, includ, wcharh, namespa... | 4 |
... | ... | ... |
7623 | [includ, stdtestcaseh, includ, wcharh, namespa... | 16 |
7624 | [includ, stdtestcaseh, ifdef, win32, defin, ba... | 11 |
7625 | [includ, stdtestcaseh, ifndef, win32, includ, ... | 5 |
7626 | [includ, stdtestcaseh, includ, vector, defin, ... | 8 |
7627 | [includ, stdtestcaseh, includ, map, use, names... | 8 |
7628 rows × 2 columns
= data_filtered['Category'] df_target
data.target_names
['CWE121_Stack_Based_Buffer_Overflow',
'CWE122_Heap_Based_Buffer_Overflow',
'CWE124_Buffer_Underwrite',
'CWE126_Buffer_Overread',
'CWE127_Buffer_Underread',
'CWE134_Uncontrolled_Format_String',
'CWE190_Integer_Overflow',
'CWE191_Integer_Underflow',
'CWE194_Unexpected_Sign_Extension',
'CWE195_Signed_to_Unsigned_Conversion_Error',
'CWE197_Numeric_Truncation_Error',
'CWE23_Relative_Path_Traversal',
'CWE369_Divide_by_Zero',
'CWE36_Absolute_Path_Traversal',
'CWE400_Resource_Exhaustion',
'CWE401_Memory_Leak',
'CWE415_Double_Free',
'CWE457_Use_of_Uninitialized_Variable',
'CWE563_Unused_Variable',
'CWE590_Free_Memory_Not_on_Heap',
'CWE680_Integer_Overflow_to_Buffer_Overflow',
'CWE690_NULL_Deref_From_Return',
'CWE762_Mismatched_Memory_Management_Routines',
'CWE789_Uncontrolled_Mem_Alloc',
'CWE78_OS_Command_Injection']
# df['Text'].head(5).apply(process_text)
'Text'].head(5).apply(process_text) data_filtered[
0 [includ, stdtestcaseh, ifdef, win32, defin, ba...
1 [includ, stdtestcaseh, includ, wcharh, namespa...
2 [ifndef, omitgood, includ, stdtestcaseh, inclu...
3 [includ, stdtestcaseh, includ, environmentw32s...
4 [includ, stdtestcaseh, includ, wcharh, namespa...
Name: Text, dtype: object
%%time
# bow_transformer = CountVectorizer(analyzer=process_text).fit(df['Text'])
= CountVectorizer(analyzer=process_text).fit(data_filtered['Text'])
bow_transformer
print(len(bow_transformer.vocabulary_)) ## 12766
## CPU times: user 26.5 s, sys: 84.2 ms, total: 26.6 s
## Wall time: 36 s
12766
CPU times: user 14.7 s, sys: 249 ms, total: 14.9 s
Wall time: 15 s
print(bow_transformer.get_feature_names_out()[114])
print(bow_transformer.get_feature_names_out()[783])
## alloca100sizeofint64t
## arraystructtwointsstruct54
alloca100sizeofint64t
arraystructtwointsstruct54
## ID of a term
# bow_transformer.vocabulary_['chihuahua']
'arraystructtwointsstruct54'] ## 783 bow_transformer.vocabulary_[
783
%%time
# text_bow = bow_transformer.transform(df['Text'])
= bow_transformer.transform(data_filtered['Text'])
text_bow
## CPU times: user 21.5 s, sys: 70.8 ms, total: 21.6 s
## Wall time: 21.8 s
CPU times: user 14.6 s, sys: 238 ms, total: 14.8 s
Wall time: 14.9 s
print('Shape of Sparse Matrix: ', text_bow.shape)
print('Amount of Non-Zero occurences: ', text_bow.nnz)
Shape of Sparse Matrix: (7628, 12766)
Amount of Non-Zero occurences: 316364
0] * text_bow.shape[1] text_bow.shape[
97379048
= (100.0 * text_bow.nnz / (text_bow.shape[0] * text_bow.shape[1]))
sparsity print('sparsity: {}'.format(sparsity)) ## sparsity: 0.3248789205661571
sparsity: 0.3248789205661571
with open('bow_transformer.pk', 'wb') as bow:
pickle.dump(bow_transformer, bow)
%%time
= TfidfTransformer().fit(text_bow)
tfidf_transformer
# ## CPU times: user 5.72 ms, sys: 941 µs, total: 6.66 ms
# ## Wall time: 13.2 ms
CPU times: user 3.43 ms, sys: 26 µs, total: 3.46 ms
Wall time: 3.47 ms
# %%time
= tfidf_transformer.transform(text_bow)
text_tfidf print(text_tfidf.shape) ## (7628, 12766)
## CPU times: user 25 ms, sys: 3.03 ms, total: 28.1 ms
## Wall time: 27.7 ms
(7628, 12766)
with open('tfidf_transformer.pk', 'wb') as bow:
pickle.dump(tfidf_transformer, bow)
# %%time
# detect_model = MultinomialNB().fit(text_tfidf, df['Category'])= MultinomialNB().fit
= MultinomialNB().fit(text_tfidf, df_target)
detect_model
## CPU times: user 26.8 ms, sys: 7.02 ms, total: 33.9 ms
## Wall time: 35.4 ms
# %%time
= detect_model.predict(text_tfidf)
all_predictions ## CPU times: user 10.4 ms, sys: 0 ns, total: 10.4 ms
## Wall time: 11.9 ms
= data['target_names']
target_names
# print(classification_report(df['Category'], all_predictions, target_names=data['target_names']))
print(classification_report(df_target, all_predictions, target_names=target_names))
precision recall f1-score support
CWE121_Stack_Based_Buffer_Overflow 0.91 0.70 0.79 324
CWE122_Heap_Based_Buffer_Overflow 0.87 0.66 0.75 316
CWE124_Buffer_Underwrite 0.63 0.85 0.72 331
CWE126_Buffer_Overread 0.86 0.92 0.89 335
CWE127_Buffer_Underread 0.92 0.77 0.83 333
CWE134_Uncontrolled_Format_String 0.98 0.90 0.94 350
CWE190_Integer_Overflow 0.94 0.79 0.86 298
CWE191_Integer_Underflow 0.95 0.77 0.85 294
CWE194_Unexpected_Sign_Extension 1.00 0.58 0.73 165
CWE195_Signed_to_Unsigned_Conversion_Error 0.97 0.55 0.70 158
CWE197_Numeric_Truncation_Error 0.70 0.90 0.79 350
CWE23_Relative_Path_Traversal 0.91 0.98 0.95 350
CWE369_Divide_by_Zero 0.85 0.92 0.88 350
CWE36_Absolute_Path_Traversal 0.83 0.91 0.87 350
CWE400_Resource_Exhaustion 1.00 0.79 0.89 156
CWE401_Memory_Leak 0.81 0.82 0.81 333
CWE415_Double_Free 0.75 0.76 0.76 350
CWE457_Use_of_Uninitialized_Variable 0.94 0.97 0.96 297
CWE563_Unused_Variable 0.81 1.00 0.90 350
CWE590_Free_Memory_Not_on_Heap 0.88 0.88 0.88 348
CWE680_Integer_Overflow_to_Buffer_Overflow 0.99 0.85 0.91 301
CWE690_NULL_Deref_From_Return 1.00 0.31 0.47 167
CWE762_Mismatched_Memory_Management_Routines 0.79 0.85 0.82 349
CWE789_Uncontrolled_Mem_Alloc 0.57 0.97 0.72 323
CWE78_OS_Command_Injection 1.00 0.95 0.97 350
accuracy 0.84 7628
macro avg 0.87 0.81 0.83 7628
weighted avg 0.86 0.84 0.84 7628
# precision recall f1-score support
# CWE121_Stack_Based_Buffer_Overflow 0.91 0.71 0.80 324
# CWE122_Heap_Based_Buffer_Overflow 0.87 0.66 0.75 316
# CWE124_Buffer_Underwrite 0.63 0.85 0.72 331
# CWE126_Buffer_Overread 0.86 0.92 0.89 335
# CWE127_Buffer_Underread 0.92 0.77 0.83 333
# CWE134_Uncontrolled_Format_String 0.98 0.90 0.94 350
# CWE190_Integer_Overflow 0.94 0.79 0.86 298
# CWE191_Integer_Underflow 0.95 0.77 0.85 294
# CWE194_Unexpected_Sign_Extension 1.00 0.59 0.74 165
# CWE195_Signed_to_Unsigned_Conversion_Error 0.99 0.55 0.71 158
# CWE197_Numeric_Truncation_Error 0.71 0.90 0.79 350
# CWE23_Relative_Path_Traversal 0.91 0.98 0.95 350
# CWE369_Divide_by_Zero 0.83 0.91 0.87 350
# CWE36_Absolute_Path_Traversal 0.82 0.91 0.86 350
# CWE400_Resource_Exhaustion 1.00 0.79 0.89 156
# CWE401_Memory_Leak 0.81 0.82 0.81 333
# CWE415_Double_Free 0.76 0.76 0.76 350
# CWE457_Use_of_Uninitialized_Variable 0.94 0.97 0.96 297
# CWE563_Unused_Variable 0.81 1.00 0.89 350
# CWE590_Free_Memory_Not_on_Heap 0.88 0.88 0.88 348
# CWE680_Integer_Overflow_to_Buffer_Overflow 0.98 0.85 0.91 301
# CWE690_NULL_Deref_From_Return 1.00 0.31 0.47 167
# CWE762_Mismatched_Memory_Management_Routines 0.79 0.85 0.82 349
# CWE789_Uncontrolled_Mem_Alloc 0.57 0.97 0.72 323
# CWE78_OS_Command_Injection 1.00 0.95 0.97 350
# accuracy 0.84 7628
# macro avg 0.87 0.81 0.83 7628
# weighted avg 0.86 0.84 0.84 7628
# %%time
= MultinomialNB()
clf # scores = cross_val_score(clf, text_tfidf, df['Category'], cv=8)
= cross_val_score(clf, text_tfidf, df_target, cv=8)
scores
#scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.74 (+/- 0.03)
# CPU times: user 156 ms, sys: 8.02 ms, total: 164 ms
# Wall time: 167 ms
Accuracy: 0.74 (+/- 0.03)
# from sklearn.ensemble import RandomForestClassifier
%%time
= RandomForestClassifier()
clf = cross_val_score(clf, text_tfidf, df_target, cv=8)
scores #scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.87 (+/- 0.02) ## Accuracy: 0.87 (+/- 0.01)
## CPU times: user 2min 7s, sys: 287 ms, total: 2min 7s
## Wall time: 2min 10s
Accuracy: 0.87 (+/- 0.02)
CPU times: user 2min 2s, sys: 437 ms, total: 2min 2s
Wall time: 2min 4s
= time.time()
start #classifier = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english') + list(string.punctuation))),('classifier', LinearSVC(C=10))])
## train_test_split
# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(text_tfidf, df['Category'], test_size=0.2, random_state=0) ## 11
= train_test_split(text_tfidf, df_target, test_size=0.2, random_state=0) ## 11 X_train, X_test, y_train, y_test
# %%time
= MultinomialNB()
clf
clf.fit(X_train, y_train)
= time.time()
end # print("Accuracy: " + str(clf.score(X_test, y_test))) #+ ", Time duration: " + str(end - start))
print("Accuracy: " + str(clf.score(X_test, y_test)) + ", Time duration: " + str(end - start))
## Accuracy: 0.7175622542595019, Time duration: 0.029235124588012695
## CPU times: user 25.7 ms, sys: 0 ns, total: 25.7 ms
## Wall time: 31.1 ms
Accuracy: 0.7326343381389253, Time duration: 0.038088321685791016
## Confusion Matrix
= clf.predict(X_test)
y_pred = confusion_matrix(y_test, y_pred) conf_mat
# Plot confusion_matrix
= plt.subplots(figsize=(15, 10))
fig, ax =True, cmap = "Set3", fmt ="d",
sns.heatmap(conf_mat, annot=data['target_names'], yticklabels=data['target_names'])
xticklabels'Actual')
plt.ylabel('Predicted')
plt.xlabel( plt.show()
# %%time
## train a NB classifer on the entire data
# nb_model = MultinomialNB().fit(text_tfidf, df['Category'])
= MultinomialNB().fit(text_tfidf, df_target)
nb_model
with open('nb_model.pk', 'wb') as nb:
pickle.dump(nb_model, nb)
## CPU times: user 26.5 ms, sys: 4.99 ms, total: 31.5 ms
## Wall time: 33.1 ms
## Index(['Category', 'Test-Case-Code'], dtype='object') data_filtered.columns
Index(['Text', 'Category'], dtype='object')
# data_filtered['Category']
df_target
0 11
1 3
2 18
3 24
4 4
..
7623 16
7624 11
7625 5
7626 8
7627 8
Name: Category, Length: 7628, dtype: int64
data_filtered
Text | Category | |
---|---|---|
0 | [includ, stdtestcaseh, ifdef, win32, defin, ba... | 11 |
1 | [includ, stdtestcaseh, includ, wcharh, namespa... | 3 |
2 | [ifndef, omitgood, includ, stdtestcaseh, inclu... | 18 |
3 | [includ, stdtestcaseh, includ, environmentw32s... | 24 |
4 | [includ, stdtestcaseh, includ, wcharh, namespa... | 4 |
... | ... | ... |
7623 | [includ, stdtestcaseh, includ, wcharh, namespa... | 16 |
7624 | [includ, stdtestcaseh, ifdef, win32, defin, ba... | 11 |
7625 | [includ, stdtestcaseh, ifndef, win32, includ, ... | 5 |
7626 | [includ, stdtestcaseh, includ, vector, defin, ... | 8 |
7627 | [includ, stdtestcaseh, includ, map, use, names... | 8 |
7628 rows × 2 columns
# %%time
# Specify the indices
# indices = [0,1,2]
= [3,4,5,6]
indices # indices = [7,8,9]
# Load the transformers and the model
= pickle.load(open("bow_transformer.pk", "rb"))
bow_transf = pickle.load(open("tfidf_transformer.pk", "rb"))
tfidf_transf with open('nb_model.pk', 'rb') as nb:
= pickle.load(nb)
model
# Create a mapping from class labels to CWE-IDs
= {i: cwe_id for i, cwe_id in enumerate(df_target.unique())}
label_to_cwe
# Loop through the specified indices
for in_sample_top_10_cwe_index in indices:
= data_filtered.iloc[in_sample_top_10_cwe_index,1]
test_text = bow_transf.transform([test_text])
test_bow = tfidf_transf.transform(test_bow)
test_tfidf = model.predict(test_tfidf)[0] # Access the first element of the array
foresights_id print('-----------------------------------------')
print(f"The predicted CWE-ID is: {foresights_id}")
# Use the mapping to get the CWE-ID
if foresights_id in label_to_cwe:
= label_to_cwe[foresights_id]
predicted_cwe_id print(f"The predicted CWE-ID is: {predicted_cwe_id}")
# Check if the model's prediction matches the actual CWE-ID
= data_filtered.iloc[in_sample_top_10_cwe_index]['Category']
actual_cwe_id = (actual_cwe_id == predicted_cwe_id)
is_correct print(f"Is the prediction correct? {is_correct}")
else:
print(f"The predicted label {foresights_id} is not in the training data.")
## CPU times: user 52.5 ms, sys: 4.17 ms, total: 56.7 ms
## Wall time: 103 ms
-----------------------------------------
The predicted CWE-ID is: 5
The predicted CWE-ID is: 17
Is the prediction correct? False
-----------------------------------------
The predicted CWE-ID is: 5
The predicted CWE-ID is: 17
Is the prediction correct? False
-----------------------------------------
The predicted CWE-ID is: 1
The predicted CWE-ID is: 3
Is the prediction correct? False
-----------------------------------------
The predicted CWE-ID is: 5
The predicted CWE-ID is: 17
Is the prediction correct? False
# -----------------------------------------
# The predicted CWE-ID is: 19
# The predicted label 19 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID is: 4
# The predicted CWE-ID is: CWE127_Buffer_Underread
# Is the prediction correct? True
# -----------------------------------------
# The predicted CWE-ID is: 5
# The predicted CWE-ID is: CWE134_Uncontrolled_Format_String
# Is the prediction correct? True
# -----------------------------------------
# The predicted CWE-ID is: 0
# The predicted CWE-ID is: CWE122_Heap_Based_Buffer_Overflow
# Is the prediction correct? False
6,1] data_filtered.iloc[
4
6,0]
data_filtered.iloc[# print(data_filtered.iloc[6,0])
['ifndef',
'omitgood',
'includ',
'stdtestcaseh',
'includ',
'charloop83h',
'namespac',
'charloop83',
'charloop83goodg2bcharloop83goodg2bchar',
'datacopi',
'data',
'datacopi',
'char',
'databuff',
'char',
'malloc100sizeofchar',
'memsetdatabuff',
'1001',
'databuffer1001',
'0',
'data',
'databuff',
'charloop83goodg2bcharloop83goodg2b',
'sizet',
'char',
'dest100',
'memsetdest',
'c',
'1001',
'dest1001',
'0',
'0',
'100',
'desti',
'datai',
'dest1001',
'0',
'printlinedest',
'endif']
## print out the actual predicted CWE-ID instaed of array([3])
= data_filtered.iloc[6,1]
test_text
# this is how you reload and use the BoW transformer
= pickle.load(open("bow_transformer.pk", "rb"))
bow_transf = bow_transf.transform([test_text])
test_bow
# this is how you reload and use the TF-IDF transformer
# remember it is applied to the result of bow_transformer
= pickle.load(open("tfidf_transformer.pk", "rb"))
tfidf_transf = tfidf_transf.transform(test_bow)
test_tfidf
# here we reload the saved NaiveBayes model and use it to predict the class of our test text
with open('nb_model.pk', 'rb') as nb:
= pickle.load(nb)
model
# model.predict(test_tfidf) ## array([3]) ## array([13])
= model.predict(test_tfidf)[0] # Access the first element of the array
foresights_id print(f"The predicted CWE-ID is: {foresights_id}")
# Create a mapping from class labels to CWE-IDs
= {i: cwe_id for i, cwe_id in enumerate(df_target.unique())}
label_to_cwe
# Use the mapping to get the CWE-ID
if foresights_id in label_to_cwe:
= label_to_cwe[foresights_id]
predicted_cwe_id print(f"The predicted CWE-ID is: {predicted_cwe_id}")
else:
print(f"The predicted label {foresights_id} is not in the training data.")
The predicted CWE-ID is: 5
The predicted CWE-ID is: 17
!wget https://raw.githubusercontent.com/c6ai/temp/main/sard.zip
--2024-01-07 10:35:35-- https://raw.githubusercontent.com/c6ai/temp/main/sard.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2947907 (2.8M) [application/zip]
Saving to: ‘sard.zip’
sard.zip 100%[===================>] 2.81M --.-KB/s in 0.02s
2024-01-07 10:35:36 (170 MB/s) - ‘sard.zip’ saved [2947907/2947907]
= pd.read_csv('sard.zip', compression='zip')
out_of_sample_df print(out_of_sample_df.head())
code CWE-Type DataType
0 \n \n \n #include "IncludeMarker"\n \n #includ... CWE114 SARD
1 \n \n \n #include "IncludeMarker"\n \n #includ... CWE114 SARD
2 \n \n \n #include "IncludeMarker"\n \n #includ... CWE114 SARD
3 \n \n \n #include "IncludeMarker"\n \n #includ... CWE114 SARD
4 \n \n \n #include "IncludeMarker"\n \n #includ... CWE114 SARD
## 52802 rows × 3 columns out_of_sample_df
code | CWE-Type | DataType | |
---|---|---|---|
0 | \n \n \n #include "IncludeMarker"\n \n #includ... | CWE114 | SARD |
1 | \n \n \n #include "IncludeMarker"\n \n #includ... | CWE114 | SARD |
2 | \n \n \n #include "IncludeMarker"\n \n #includ... | CWE114 | SARD |
3 | \n \n \n #include "IncludeMarker"\n \n #includ... | CWE114 | SARD |
4 | \n \n \n #include "IncludeMarker"\n \n #includ... | CWE114 | SARD |
... | ... | ... | ... |
52797 | \n \n \n #include "IncludeMarker"\n \n #define... | CWE688 | SARD |
52798 | \n \n \n #include "IncludeMarker"\n \n #define... | CWE688 | SARD |
52799 | \n \n \n #include "IncludeMarker"\n \n #define... | CWE688 | SARD |
52800 | \n \n \n #include "IncludeMarker"\n \n #define... | CWE688 | SARD |
52801 | \n \n \n #include "IncludeMarker"\n \n #define... | CWE688 | SARD |
52802 rows × 3 columns
## Index(['code', 'CWE-Type', 'DataType'], dtype='object') out_of_sample_df.columns
Index(['code', 'CWE-Type', 'DataType'], dtype='object')
= list(out_of_sample_df['CWE-Type'].unique())
cwe_list print(len(cwe_list)) ## 109
print(cwe_list)
109
['CWE114', 'CWE121', 'CWE122', 'CWE123', 'CWE124', 'CWE126', 'CWE127', 'CWE134', 'CWE15', 'CWE176', 'CWE188', 'CWE190', 'CWE191', 'CWE194', 'CWE195', 'CWE196', 'CWE197', 'CWE222', 'CWE223', 'CWE226', 'CWE242', 'CWE244', 'CWE247', 'CWE252', 'CWE253', 'CWE256', 'CWE259', 'CWE272', 'CWE273', 'CWE284', 'CWE319', 'CWE321', 'CWE325', 'CWE327', 'CWE328', 'CWE338', 'CWE364', 'CWE366', 'CWE367', 'CWE369', 'CWE377', 'CWE390', 'CWE391', 'CWE398', 'CWE400', 'CWE401', 'CWE404', 'CWE415', 'CWE416', 'CWE426', 'CWE427', 'CWE457', 'CWE459', 'CWE464', 'CWE467', 'CWE468', 'CWE469', 'CWE475', 'CWE476', 'CWE478', 'CWE479', 'CWE480', 'CWE481', 'CWE482', 'CWE483', 'CWE484', 'CWE506', 'CWE510', 'CWE511', 'CWE526', 'CWE534', 'CWE535', 'CWE546', 'CWE561', 'CWE562', 'CWE563', 'CWE570', 'CWE571', 'CWE587', 'CWE588', 'CWE590', 'CWE591', 'CWE605', 'CWE606', 'CWE615', 'CWE617', 'CWE620', 'CWE665', 'CWE666', 'CWE667', 'CWE674', 'CWE675', 'CWE680', 'CWE681', 'CWE685', 'CWE690', 'CWE758', 'CWE761', 'CWE773', 'CWE775', 'CWE780', 'CWE785', 'CWE789', 'CWE78', 'CWE832', 'CWE835', 'CWE843', 'CWE90', 'CWE688']
len(data.target_names)
25
data.target_names
['CWE121_Stack_Based_Buffer_Overflow',
'CWE122_Heap_Based_Buffer_Overflow',
'CWE124_Buffer_Underwrite',
'CWE126_Buffer_Overread',
'CWE127_Buffer_Underread',
'CWE134_Uncontrolled_Format_String',
'CWE190_Integer_Overflow',
'CWE191_Integer_Underflow',
'CWE194_Unexpected_Sign_Extension',
'CWE195_Signed_to_Unsigned_Conversion_Error',
'CWE197_Numeric_Truncation_Error',
'CWE23_Relative_Path_Traversal',
'CWE369_Divide_by_Zero',
'CWE36_Absolute_Path_Traversal',
'CWE400_Resource_Exhaustion',
'CWE401_Memory_Leak',
'CWE415_Double_Free',
'CWE457_Use_of_Uninitialized_Variable',
'CWE563_Unused_Variable',
'CWE590_Free_Memory_Not_on_Heap',
'CWE680_Integer_Overflow_to_Buffer_Overflow',
'CWE690_NULL_Deref_From_Return',
'CWE762_Mismatched_Memory_Management_Routines',
'CWE789_Uncontrolled_Mem_Alloc',
'CWE78_OS_Command_Injection']
# Assuming data.target_names is a list of target names
= data.target_names
target_names
# Create an empty list to store the top 22/25 CWE-Type
= []
top_22_cwe_list
# Iterate over the unique CWE-Type list
for cwe_type in cwe_list:
# Check if the CWE-Type matches the beginning part of any target name
for target in target_names:
if target.startswith(cwe_type):
top_22_cwe_list.append(cwe_type)break
# Stop adding to the list after the top 22 ##!25
if len(top_22_cwe_list) == 22:
break
print(len(top_22_cwe_list)) ## 22
print(top_22_cwe_list)
## ['CWE121', 'CWE122', 'CWE124', 'CWE126', 'CWE127', 'CWE134', 'CWE190', 'CWE191','CWE194', 'CWE195', 'CWE197',
## 'CWE369', 'CWE400', 'CWE401', 'CWE415', 'CWE457', 'CWE563', 'CWE590', 'CWE680', 'CWE690', 'CWE789', 'CWE78']
22
['CWE121', 'CWE122', 'CWE124', 'CWE126', 'CWE127', 'CWE134', 'CWE190', 'CWE191', 'CWE194', 'CWE195', 'CWE197', 'CWE369', 'CWE400', 'CWE401', 'CWE415', 'CWE457', 'CWE563', 'CWE590', 'CWE680', 'CWE690', 'CWE789', 'CWE78']
## make a top_22_cwe_testcase_samples_df with one sample each for the 1st 'code' with 'CWE-Type' that matches op_22_cwe_list
# Assuming out_of_sample_df is the DataFrame and top_22_cwe_list is the list
= pd.DataFrame()
top_22_cwe_testcase_samples_df
for cwe in top_22_cwe_list:
= out_of_sample_df[out_of_sample_df['CWE-Type'] == cwe].head(1)
sample_df = pd.concat([top_22_cwe_testcase_samples_df, sample_df]) top_22_cwe_testcase_samples_df
print(len(top_22_cwe_testcase_samples_df)) ## 22
print(top_22_cwe_testcase_samples_df)
22
code CWE-Type DataType
648 \n \n \n #include "IncludeMarker"\n \n #ifndef... CWE121 SARD
6351 \n \n \n #include "IncludeMarker"\n \n #ifndef... CWE122 SARD
10044 \n \n \n #include "IncludeMarker"\n \n #includ... CWE124 SARD
11868 \n \n \n #include "IncludeMarker"\n \n #includ... CWE126 SARD
13200 \n \n \n #include "IncludeMarker"\n \n #includ... CWE127 SARD
15024 \n \n \n #include "IncludeMarker"\n \n #ifndef... CWE134 SARD
18408 \n \n \n #include "IncludeMarker"\n \n #ifndef... CWE190 SARD
23268 \n \n \n #include "IncludeMarker"\n \n #ifndef... CWE191 SARD
26994 \n \n \n #include "IncludeMarker"\n \n #ifdef ... CWE194 SARD
28301 \n \n \n #include "IncludeMarker"\n \n #ifdef ... CWE195 SARD
29627 \n \n \n #include "IncludeMarker"\n \n #ifdef ... CWE197 SARD
33468 \n \n \n #include "IncludeMarker"\n \n #includ... CWE369 SARD
34891 \n \n \n #include "IncludeMarker"\n \n #ifdef ... CWE400 SARD
35701 \n \n \n #include "IncludeMarker"\n \n #includ... CWE401 SARD
37325 \n \n \n #include "IncludeMarker"\n \n #includ... CWE415 SARD
38555 \n \n \n #include "IncludeMarker"\n \n #includ... CWE457 SARD
40395 \n \n #include "IncludeMarker"\n \n #ifndef OM... CWE563 SARD
40860 \n \n \n #include "IncludeMarker"\n \n #includ... CWE590 SARD
43314 \n \n \n #include "IncludeMarker"\n \n #ifdef ... CWE680 SARD
43713 \n \n \n #include "IncludeMarker"\n \n #includ... CWE690 SARD
46126 \n \n \n #include "IncludeMarker"\n \n #ifndef... CWE789 SARD
46666 \n \n \n #include "IncludeMarker"\n \n #includ... CWE78 SARD
0, 1] ## CWE121 top_22_cwe_testcase_samples_df.iloc[
{"type":"string"}
# print(out_of_sample_df.iloc[0, 0])
= top_22_cwe_testcase_samples_df.iloc[0, 0]
cwe_testcase_sample # print(cwe_testcase_sample)
!wget https://raw.githubusercontent.com/c6ai/temp/main/cpp_clean_files_top_10_cwe_omitted.tar.gz
# ## cpp_clean_files_top_10_cwe_omitted.tar.gz
# !gdown 1YQHdd457W4NjuTvJYiucKUr8pRwbGulj
--2024-01-07 10:35:38-- https://raw.githubusercontent.com/c6ai/temp/main/cpp_clean_files_top_10_cwe_omitted.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14922 (15K) [application/octet-stream]
Saving to: ‘cpp_clean_files_top_10_cwe_omitted.tar.gz’
cpp_clean 0%[ ] 0 --.-KB/s
cpp_clean_files_top 100%[===================>] 14.57K --.-KB/s in 0s
2024-01-07 10:35:38 (93.2 MB/s) - ‘cpp_clean_files_top_10_cwe_omitted.tar.gz’ saved [14922/14922]
# !cp -r cpp_clean_omitted_cwe_folders cpp_clean_files_top_10_cwe_omitted
# !tar -czf cpp_clean_files_top_10_cwe_omitted.tar.gz cpp_clean_files_top_10_cwe_omitted
!tar -xzf cpp_clean_files_top_10_cwe_omitted.tar.gz cpp_clean_files_top_10_cwe_omitted
# Initialize an empty list to store the data
= []
data
# Specify the root folder
= 'cpp_clean_files_top_10_cwe_omitted'
root_folder
# Traverse the directory
for subdir, dirs, files in os.walk(root_folder):
for file in files:
# Check if the file is a .cpp file
if file.endswith('.cpp'):
# Get the CWE-ID from the subfolder name
= os.path.basename(subdir)
cwe_id
# Open the file and read its content
with open(os.path.join(subdir, file), 'r') as f:
= f.read()
file_content
# Append the data to the list
data.append([cwe_id, file_content])
# We only need the first .cpp file from each subfolder
break
# Create a DataFrame
= pd.DataFrame(data, columns=['CWE-ID', 'Test-Case-Code'])
in_sample_top_10_cwe_df
# Print the DataFrame
print(in_sample_top_10_cwe_df)
CWE-ID \
0 CWE122_Heap_Based_Buffer_Overflow
1 CWE401_Memory_Leak
2 CWE36_Absolute_Path_Traversal
3 CWE127_Buffer_Underread
4 CWE23_Relative_Path_Traversal
5 CWE121_Stack_Based_Buffer_Overflow
6 CWE415_Double_Free
7 CWE762_Mismatched_Memory_Management_Routines
8 CWE590_Free_Memory_Not_on_Heap
9 CWE134_Uncontrolled_Format_String
Test-Case-Code
0 \n\n\n#include "std_testcase.h"\n\n#include <w...
1 \n\n#ifndef OMITBAD\n\n#include "std_testcase....
2 \n\n\n#include "std_testcase.h"\n\n#ifndef _WI...
3 \n\n\n#include "std_testcase.h"\n\n#include <w...
4 \n\n\n#include "std_testcase.h"\n\n#ifdef _WIN...
5 \n\n\n#include "std_testcase.h"\n#include <lis...
6 \n\n#ifndef OMITBAD\n\n#include "std_testcase....
7 \n\n\n#include "std_testcase.h"\n#include <lis...
8 \n\n\n#include "std_testcase.h"\n\n#include <w...
9 \n\n#ifndef OMITGOOD\n\n#include "std_testcase...
# .iloc[0,0] ## Index(['CWE-ID', 'Test-Case-Code'], dtype='object') in_sample_top_10_cwe_df.columns
Index(['CWE-ID', 'Test-Case-Code'], dtype='object')
# in_sample_top_10_cwe_df.iloc[0,1]
print(in_sample_top_10_cwe_df.iloc[0,1])
#include "std_testcase.h"
#include <wchar.h>
static int staticReturnsTrue()
{
return 1;
}
static int staticReturnsFalse()
{
return 0;
}
namespace CWE805_char_memcpy_08
{
#ifndef OMITBAD
void bad()
{
char * data;
data = NULL;
if(staticReturnsTrue())
{
data = new char[50];
data[0] = '\0';
}
{
char source[100];
memset(source, 'C', 100-1);
source[100-1] = '\0';
memcpy(data, source, 100*sizeof(char));
data[100-1] = '\0';
printLine(data);
delete [] data;
}
}
#endif
#ifndef OMITGOOD
static void goodG2B1()
{
char * data;
data = NULL;
if(staticReturnsFalse())
{
printLine("Benign, fixed string");
}
else
{
data = new char[100];
data[0] = '\0';
}
{
char source[100];
memset(source, 'C', 100-1);
source[100-1] = '\0';
memcpy(data, source, 100*sizeof(char));
data[100-1] = '\0';
printLine(data);
delete [] data;
}
}
static void goodG2B2()
{
char * data;
data = NULL;
if(staticReturnsTrue())
{
data = new char[100];
data[0] = '\0';
}
{
char source[100];
memset(source, 'C', 100-1);
source[100-1] = '\0';
memcpy(data, source, 100*sizeof(char));
data[100-1] = '\0';
printLine(data);
delete [] data;
}
}
void good()
{
goodG2B1();
goodG2B2();
}
#endif
}
#ifdef INCLUDEMAIN
using namespace CWE805_char_memcpy_08;
int main(int argc, char * argv[])
{
srand( (unsigned)time(NULL) );
#ifndef OMITGOOD
printLine("Calling good()...");
good();
printLine("Finished good()");
#endif
#ifndef OMITBAD
printLine("Calling bad()...");
bad();
printLine("Finished bad()");
#endif
return 0;
}
#endif
0,0] ## CWE122_Heap_Based_Buffer_Overflow in_sample_top_10_cwe_df.iloc[
{"type":"string"}
print(in_sample_top_10_cwe_df.shape) # (10, 2)
(10, 2)
..........
## Index(['CWE-ID', 'Test-Case-Code'], dtype='object') in_sample_top_10_cwe_df.columns
Index(['CWE-ID', 'Test-Case-Code'], dtype='object')
'CWE-ID'] in_sample_top_10_cwe_df[
0 CWE122_Heap_Based_Buffer_Overflow
1 CWE401_Memory_Leak
2 CWE36_Absolute_Path_Traversal
3 CWE127_Buffer_Underread
4 CWE23_Relative_Path_Traversal
5 CWE121_Stack_Based_Buffer_Overflow
6 CWE415_Double_Free
7 CWE762_Mismatched_Memory_Management_Routines
8 CWE590_Free_Memory_Not_on_Heap
9 CWE134_Uncontrolled_Format_String
Name: CWE-ID, dtype: object
in_sample_top_10_cwe_df
CWE-ID | Test-Case-Code | |
---|---|---|
0 | CWE122_Heap_Based_Buffer_Overflow | \n\n\n#include "std_testcase.h"\n\n#include <w... |
1 | CWE401_Memory_Leak | \n\n#ifndef OMITBAD\n\n#include "std_testcase.... |
2 | CWE36_Absolute_Path_Traversal | \n\n\n#include "std_testcase.h"\n\n#ifndef _WI... |
3 | CWE127_Buffer_Underread | \n\n\n#include "std_testcase.h"\n\n#include <w... |
4 | CWE23_Relative_Path_Traversal | \n\n\n#include "std_testcase.h"\n\n#ifdef _WIN... |
5 | CWE121_Stack_Based_Buffer_Overflow | \n\n\n#include "std_testcase.h"\n#include <lis... |
6 | CWE415_Double_Free | \n\n#ifndef OMITBAD\n\n#include "std_testcase.... |
7 | CWE762_Mismatched_Memory_Management_Routines | \n\n\n#include "std_testcase.h"\n#include <lis... |
8 | CWE590_Free_Memory_Not_on_Heap | \n\n\n#include "std_testcase.h"\n\n#include <w... |
9 | CWE134_Uncontrolled_Format_String | \n\n#ifndef OMITGOOD\n\n#include "std_testcase... |
# %%time
# Specify the indices
# indices = [0,1,2]
= [3,4,5,6]
indices # indices = [7,8,9]
# Load the transformers and the model
= pickle.load(open("bow_transformer.pk", "rb"))
bow_transf = pickle.load(open("tfidf_transformer.pk", "rb"))
tfidf_transf with open('nb_model.pk', 'rb') as nb:
= pickle.load(nb)
model
# Create a mapping from class labels to CWE-IDs
= {i: cwe_id for i, cwe_id in enumerate(in_sample_top_10_cwe_df['CWE-ID'].unique())}
label_to_cwe
# Loop through the specified indices
for in_sample_top_10_cwe_index in indices:
= in_sample_top_10_cwe_df.iloc[in_sample_top_10_cwe_index,1]
test_text = bow_transf.transform([test_text])
test_bow = tfidf_transf.transform(test_bow)
test_tfidf = model.predict(test_tfidf)[0] # Access the first element of the array
foresights_id print('-----------------------------------------')
print(f"The predicted CWE-ID is: {foresights_id}")
# Use the mapping to get the CWE-ID
if foresights_id in label_to_cwe:
= label_to_cwe[foresights_id]
predicted_cwe_id print(f"The predicted CWE-ID is: {predicted_cwe_id}")
# Check if the model's prediction matches the actual CWE-ID
= in_sample_top_10_cwe_df.iloc[in_sample_top_10_cwe_index]['CWE-ID']
actual_cwe_id = (actual_cwe_id == predicted_cwe_id)
is_correct print(f"Is the prediction correct? {is_correct}")
else:
print(f"The predicted label {foresights_id} is not in the training data.")
## CPU times: user 52.5 ms, sys: 4.17 ms, total: 56.7 ms
## Wall time: 103 ms
-----------------------------------------
The predicted CWE-ID is: 4
The predicted CWE-ID is: CWE23_Relative_Path_Traversal
Is the prediction correct? False
-----------------------------------------
The predicted CWE-ID is: 11
The predicted label 11 is not in the training data.
-----------------------------------------
The predicted CWE-ID is: 0
The predicted CWE-ID is: CWE122_Heap_Based_Buffer_Overflow
Is the prediction correct? False
-----------------------------------------
The predicted CWE-ID is: 16
The predicted label 16 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID is: 19
# The predicted label 19 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID is: 4
# The predicted CWE-ID is: CWE127_Buffer_Underread
# Is the prediction correct? True
# -----------------------------------------
# The predicted CWE-ID is: 5
# The predicted CWE-ID is: CWE134_Uncontrolled_Format_String
# Is the prediction correct? True
# -----------------------------------------
# The predicted CWE-ID is: 0
# The predicted CWE-ID is: CWE122_Heap_Based_Buffer_Overflow
# Is the prediction correct? False
# # Load the transformers and the model
# bow_transf = pickle.load(open("bow_transformer.pk", "rb"))
# tfidf_transf = pickle.load(open("tfidf_transformer.pk", "rb"))
# with open('nb_model.pk', 'rb') as nb:
# model = pickle.load(nb)
# # Create a mapping from class labels to CWE-IDs
# label_to_cwe = {i: cwe_id for i, cwe_id in enumerate(in_sample_top_10_cwe_df['CWE-ID'].unique())}
# # Loop through all 10 samples
# for in_sample_top_10_cwe_index in range(10):
# test_text = in_sample_top_10_cwe_df.iloc[in_sample_top_10_cwe_index,1]
# test_bow = bow_transf.transform([test_text])
# test_tfidf = tfidf_transf.transform(test_bow)
# foresights_id = model.predict(test_tfidf)[0] # Access the first element of the array
# print('-----------------------------------------')
# print(f"The predicted CWE-ID index is: {foresights_id}")
# # Use the mapping to get the CWE-ID
# if foresights_id in label_to_cwe:
# predicted_cwe_id = label_to_cwe[foresights_id]
# print(f"The predicted CWE-ID label is: {predicted_cwe_id}")
# # Check if the model's prediction matches the actual CWE-ID
# actual_cwe_id = in_sample_top_10_cwe_df.iloc[in_sample_top_10_cwe_index]['CWE-ID']
# is_correct = (actual_cwe_id == predicted_cwe_id)
# print(f"Is the prediction correct? {is_correct}")
# else:
# print(f"The predicted label {foresights_id} is not in the training data.")
# -----------------------------------------
# The predicted CWE-ID index is: 3
# The predicted CWE-ID label is: CWE415_Double_Free
# Is the prediction correct? False
# -----------------------------------------
# The predicted CWE-ID index is: 11
# The predicted label 11 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID index is: 19
# The predicted label 19 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID index is: 19
# The predicted label 19 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID index is: 4
# The predicted CWE-ID label is: CWE127_Buffer_Underread
# Is the prediction correct? True
# -----------------------------------------
# The predicted CWE-ID index is: 5
# The predicted CWE-ID label is: CWE134_Uncontrolled_Format_String
# Is the prediction correct? True
# -----------------------------------------
# The predicted CWE-ID index is: 0
# The predicted CWE-ID label is: CWE122_Heap_Based_Buffer_Overflow
# Is the prediction correct? False
# -----------------------------------------
# The predicted CWE-ID index is: 11
# The predicted label 11 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID index is: 22
# The predicted label 22 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID index is: 15
# The predicted label 15 is not in the training data.
6,1] in_sample_top_10_cwe_df.iloc[
{"type":"string"}
6,0] in_sample_top_10_cwe_df.iloc[
{"type":"string"}
## print out the actual predicted CWE-ID instaed of array([3])
= in_sample_top_10_cwe_df.iloc[6,1]
test_text
# this is how you reload and use the BoW transformer
= pickle.load(open("bow_transformer.pk", "rb"))
bow_transf = bow_transf.transform([test_text])
test_bow
# this is how you reload and use the TF-IDF transformer
# remember it is applied to the result of bow_transformer
= pickle.load(open("tfidf_transformer.pk", "rb"))
tfidf_transf = tfidf_transf.transform(test_bow)
test_tfidf
# here we reload the saved NaiveBayes model and use it to predict the class of our test text
with open('nb_model.pk', 'rb') as nb:
= pickle.load(nb)
model
# model.predict(test_tfidf) ## array([3]) ## array([13])
= model.predict(test_tfidf)[0] # Access the first element of the array
foresights_id print(f"The predicted CWE-ID is: {foresights_id}")
# Create a mapping from class labels to CWE-IDs
= {i: cwe_id for i, cwe_id in enumerate(in_sample_top_10_cwe_df['CWE-ID'].unique())}
label_to_cwe
# Use the mapping to get the CWE-ID
if foresights_id in label_to_cwe:
= label_to_cwe[foresights_id]
predicted_cwe_id print(f"The predicted CWE-ID is: {predicted_cwe_id}")
else:
print(f"The predicted label {foresights_id} is not in the training data.")
The predicted CWE-ID is: 16
The predicted label 16 is not in the training data.
.......
# text_tfidf = tfidf_transformer.transform(text_bow)
# print(text_tfidf.shape) ## (7628, 12766)
# # Train non-leaking samples
# # text_tfidf.shape ## (7628, 12766)
from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(text_tfidf, df['Category'], test_size=0.2, random_state=0) ## 11
= train_test_split(text_tfidf, df_target, test_size=0.2, random_state=0) ## 11 X_train, X_test, y_train, y_test
%%time
# text_tfidf[2:4,1000:2050].toarray()
# text_tfidf['Category'].toarray()
# test_bow = bow_transformer.transform([input_text])
= bow_transformer.transform(df['Text'])
test_bow = tfidf_transformer.transform(test_bow)
test_data # model.predict(test_data)
## (7628, 12766)
test_data.shape ## CPU times: user 16.8 s, sys: 36.6 ms, total: 16.8 s
## Wall time: 16.8 s
CPU times: user 15.1 s, sys: 257 ms, total: 15.4 s
Wall time: 15.5 s
(7628, 12766)
.............
# X_text=data_filtered.iloc[:,0] # subset of only [Description] raw text
=text_tfidf ## OK!
X_text X_text
<7628x12766 sparse matrix of type '<class 'numpy.float64'>'
with 316364 stored elements in Compressed Sparse Row format>
# y_category=data_filtered.iloc[:,1:4] # subset of only [Category, Level_2, Level_3] Category cols
=df_target # df_cat # data['Category'] ## dataset.target
y_category y_category
0 11
1 3
2 18
3 24
4 4
..
7623 16
7624 11
7625 5
7626 8
7627 8
Name: Category, Length: 7628, dtype: int64
= text_tfidf # data['Text'] ## dataset.data #! ## ValueError: could not convert string to float: 'peopl allar ...'
X # y = data['Category'] # df_cat # data['Category'] ## dataset.target
= df_target # df_cat # data['Category'] ## dataset.target
y
= train_test_split(X, y, test_size = 0.1, random_state =0) X_train, X_test, y_train, y_test
## Train/Test split
# train,test=train_test_split(data_drop_na,test_size=0.25,random_state=0)
# X_tr,X_te,y_tr,y_te=train_test_split(X_text,y_category,test_size=0.2,random_state=0) # test_size=0.25
=train_test_split(X_text,y_category,test_size=0.2,random_state=0) # test_size=0.25 X_tr,X_te,y_tr,y_te
## train.shape ## (7977, 4) ## (7986, 16506)
# y_tr.shape ## (7970, 3) ## (7977, 3)
## (1526, 1) ## (2657, 3) ## (2660, 3) y_te.shape
(1526,)
"""
## Setup
"""
## !pip install -q "tensorflow-text" # ==2.13.*"
# import tensorflow_text as tf_text
# ## TensorFlow backend only supports string inputs
# os.environ["KERAS_BACKEND"] = "tensorflow"
# import keras
# from keras import layers
# import tensorflow as tf
## IMPORTS & Params at the top
{"type":"string"}
"""
## Download the SARD cpp_8750_files data
"""
# !gdown 1Q_P8bYpvdSEbp6NnCzfqU3lwQwxUlfE3
!wget https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
# data_path = keras.utils.get_file(
# "sard.zip",
# "https://raw.githubusercontent.com/c6ai/temp/main/sard.zip",
# untar=True,
# )
# data_path = keras.utils.get_file(
# "cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz",
# "https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz",
# untar=True,
# )
--2024-01-07 10:35:54-- https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 895869 (875K) [application/octet-stream]
Saving to: ‘cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz’
cpp_clean 0%[ ] 0 --.-KB/s
cpp_cleaner_8750_fi 100%[===================>] 874.87K --.-KB/s in 0.006s
2024-01-07 10:35:54 (140 MB/s) - ‘cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz’ saved [895869/895869]
## extract or un-tar (unzip):
!tar -xzf cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
!ls -lh # /content/
## total 880K
## drwxr-xr-x 27 root root 4.0K Jan 2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
## -rw-r--r-- 1 root root 875K Jan 2 09:29 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
total 20M
-rw-r--r-- 1 root root 339K Jan 7 10:33 bow_transformer.pk
drwxr-xr-x 27 root root 4.0K Jan 2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
-rw-r--r-- 1 root root 875K Jan 7 10:35 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
drwxr-xr-x 12 root root 4.0K Jan 3 17:15 cpp_clean_files_top_10_cwe_omitted
-rw-r--r-- 1 root root 15K Jan 7 10:35 cpp_clean_files_top_10_cwe_omitted.tar.gz
-rw-r--r-- 1 root root 11M Jan 7 10:32 data_drop_na.csv
-rw-r--r-- 1 root root 4.9M Jan 7 10:35 nb_model.pk
-rw-r--r-- 1 root root 2.9M Jan 7 10:35 sard.zip
-rw-r--r-- 1 root root 201K Jan 7 10:33 tfidf_transformer.pk
# !ls -lh /root/.keras/datasets/
# ## -rw-r--r-- 1 root root 2.3K Jan 2 08:00 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
# data_path ## /root/.keras/datasets/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
= 'cpp_cleaner_8750_files_each_350_top_25_cwe_omitted' data_path
# !ls -lh /root/.keras/datasets/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
"""
## Let's take a look at the data
"""
= pathlib.Path(data_path).parent / "cpp_cleaner_8750_files_each_350_top_25_cwe_omitted"
data_dir = os.listdir(data_dir)
dirnames print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)
= os.listdir(data_dir / "CWE122_Heap_Based_Buffer_Overflow")
fnames print("Number of files in CWE122_Heap_Based_Buffer_Overflow:", len(fnames))
print("Some example filenames:", fnames[:5])
# Number of directories: 25
# Directory names:
## ['CWE369_Divide_by_Zero', 'CWE762_Mismatched_Memory_Management_Routines',
## 'CWE195_Signed_to_Unsigned_Conversion_Error', 'CWE36_Absolute_Path_Traversal',
## 'CWE590_Free_Memory_Not_on_Heap', 'CWE690_NULL_Deref_From_Return', 'CWE789_Uncontrolled_Mem_Alloc',
## 'CWE197_Numeric_Truncation_Error', 'CWE401_Memory_Leak', 'CWE400_Resource_Exhaustion',
## 'CWE127_Buffer_Underread', 'CWE121_Stack_Based_Buffer_Overflow', 'CWE191_Integer_Underflow',
## 'CWE122_Heap_Based_Buffer_Overflow', 'CWE190_Integer_Overflow', 'CWE563_Unused_Variable',
## 'CWE194_Unexpected_Sign_Extension', 'CWE23_Relative_Path_Traversal', 'CWE457_Use_of_Uninitialized_Variable',
## 'CWE415_Double_Free', 'CWE126_Buffer_Overread', 'CWE134_Uncontrolled_Format_String',
## 'CWE680_Integer_Overflow_to_Buffer_Overflow', 'CWE78_OS_Command_Injection', 'CWE124_Buffer_Underwrite']
# Number of files in CWE122_Heap_Based_Buffer_Overflow: 316
# Some example filenames:
## ['t_memcpy_72a.cpp', 'loop_61b.cpp', 'dest_char_cat_34.cpp', 'ncpy_54b.cpp', 'ncpy_43.cpp']
Number of directories: 25
Directory names: ['CWE124_Buffer_Underwrite', 'CWE122_Heap_Based_Buffer_Overflow', 'CWE194_Unexpected_Sign_Extension', 'CWE401_Memory_Leak', 'CWE126_Buffer_Overread', 'CWE36_Absolute_Path_Traversal', 'CWE457_Use_of_Uninitialized_Variable', 'CWE195_Signed_to_Unsigned_Conversion_Error', 'CWE197_Numeric_Truncation_Error', 'CWE78_OS_Command_Injection', 'CWE563_Unused_Variable', 'CWE127_Buffer_Underread', 'CWE680_Integer_Overflow_to_Buffer_Overflow', 'CWE23_Relative_Path_Traversal', 'CWE121_Stack_Based_Buffer_Overflow', 'CWE369_Divide_by_Zero', 'CWE190_Integer_Overflow', 'CWE415_Double_Free', 'CWE762_Mismatched_Memory_Management_Routines', 'CWE690_NULL_Deref_From_Return', 'CWE590_Free_Memory_Not_on_Heap', 'CWE191_Integer_Underflow', 'CWE400_Resource_Exhaustion', 'CWE134_Uncontrolled_Format_String', 'CWE789_Uncontrolled_Mem_Alloc']
Number of files in CWE122_Heap_Based_Buffer_Overflow: 316
Some example filenames: ['socket_82_goodG2B.cpp', 'cpy_82a.cpp', 't_memcpy_43.cpp', 't_cpy_63b.cpp', 'snprintf_84_goodG2B.cpp']
"""
Here's a example of what one file contains:
"""
## /content/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted/CWE122_Heap_Based_Buffer_Overflow/83_goodG2B.cpp
# print(open(data_dir / "CWE122_Heap_Based_Buffer_Overflow" / "83_goodG2B.cpp").read())
"""
As you can see, there are header lines that are leaking the file's category, either
explicitly (the first line is literally the category name), or implicitly, e.g. via the
`Organization` filed. Let's get rid of the headers:
"""
{"type":"string"}
# Initialize the count
= 0
count
# Traverse the directory tree
# for dirpath, dirs, files in os.walk('cpp_cleaner_8750_files_each_350_top_25_cwe_omitted'):
for dirpath, dirs, files in os.walk('cpp_cleaner_8750_files_each_350_top_25_cwe_omitted'):
for filename in files:
# Count the occurrences of 'CWE' in the file name
+= len(re.findall(r'CWE', filename))
count # Get the file path
= os.path.join(dirpath, filename)
file_path with open(file_path, 'r') as file:
try:
= file.read()
file_data # Count the occurrences of 'CWE' in the file data
+= len(re.findall(r'CWE', file_data))
count except UnicodeDecodeError:
# Skip files that can't be decoded
pass
print(count)
## 1052 ## clean
## 0 ## cleaner
0
# Get a list of all file paths and their corresponding labels
= 'cpp_cleaner_8750_files_each_350_top_25_cwe_omitted'
root_folder_path = []
file_paths = []
labels for dirpath, dirnames, filenames in os.walk(root_folder_path):
for filename in filenames:
file_paths.append(os.path.join(dirpath, filename))# Use the name of the parent directory as the label
labels.append(os.path.basename(dirpath))
# Split the file paths into training and testing sets
= train_test_split(file_paths, labels, test_size=0.2, stratify=labels)
file_paths_train, file_paths_test, labels_train, labels_test
# Function to copy files to a new directory structure
def copy_files(file_paths, labels, dest_folder):
for file_path, label in zip(file_paths, labels):
= os.path.join(dest_folder, label)
dest_dir if not os.path.isdir(dest_dir):
os.makedirs(dest_dir)
shutil.copy(file_path, dest_dir)
# Copy the files to the 'train' and 'test' directories
'train')
copy_files(file_paths_train, labels_train, 'test') copy_files(file_paths_test, labels_test,
!find train -type f | cut -d/ -f2 | sort | uniq -c
259 CWE121_Stack_Based_Buffer_Overflow
253 CWE122_Heap_Based_Buffer_Overflow
265 CWE124_Buffer_Underwrite
268 CWE126_Buffer_Overread
266 CWE127_Buffer_Underread
280 CWE134_Uncontrolled_Format_String
238 CWE190_Integer_Overflow
235 CWE191_Integer_Underflow
132 CWE194_Unexpected_Sign_Extension
127 CWE195_Signed_to_Unsigned_Conversion_Error
280 CWE197_Numeric_Truncation_Error
280 CWE23_Relative_Path_Traversal
280 CWE369_Divide_by_Zero
280 CWE36_Absolute_Path_Traversal
125 CWE400_Resource_Exhaustion
266 CWE401_Memory_Leak
280 CWE415_Double_Free
238 CWE457_Use_of_Uninitialized_Variable
280 CWE563_Unused_Variable
278 CWE590_Free_Memory_Not_on_Heap
241 CWE680_Integer_Overflow_to_Buffer_Overflow
134 CWE690_NULL_Deref_From_Return
279 CWE762_Mismatched_Memory_Management_Routines
258 CWE789_Uncontrolled_Mem_Alloc
280 CWE78_OS_Command_Injection
def count_files_in_subfolders(root_folder):
for dirpath, dirnames, filenames in os.walk(root_folder):
print(f"There are {len(filenames)} files in the '{os.path.relpath(dirpath, root_folder)}' subfolder.")
# print("In the 'train' directory:")
# count_files_in_subfolders('train')
print("\nIn the 'test' directory:")
'test') count_files_in_subfolders(
In the 'test' directory:
There are 0 files in the '.' subfolder.
There are 66 files in the 'CWE124_Buffer_Underwrite' subfolder.
There are 63 files in the 'CWE122_Heap_Based_Buffer_Overflow' subfolder.
There are 33 files in the 'CWE194_Unexpected_Sign_Extension' subfolder.
There are 67 files in the 'CWE401_Memory_Leak' subfolder.
There are 67 files in the 'CWE126_Buffer_Overread' subfolder.
There are 70 files in the 'CWE36_Absolute_Path_Traversal' subfolder.
There are 59 files in the 'CWE457_Use_of_Uninitialized_Variable' subfolder.
There are 31 files in the 'CWE195_Signed_to_Unsigned_Conversion_Error' subfolder.
There are 70 files in the 'CWE197_Numeric_Truncation_Error' subfolder.
There are 70 files in the 'CWE78_OS_Command_Injection' subfolder.
There are 70 files in the 'CWE563_Unused_Variable' subfolder.
There are 67 files in the 'CWE127_Buffer_Underread' subfolder.
There are 60 files in the 'CWE680_Integer_Overflow_to_Buffer_Overflow' subfolder.
There are 70 files in the 'CWE23_Relative_Path_Traversal' subfolder.
There are 65 files in the 'CWE121_Stack_Based_Buffer_Overflow' subfolder.
There are 70 files in the 'CWE369_Divide_by_Zero' subfolder.
There are 60 files in the 'CWE190_Integer_Overflow' subfolder.
There are 70 files in the 'CWE415_Double_Free' subfolder.
There are 70 files in the 'CWE762_Mismatched_Memory_Management_Routines' subfolder.
There are 33 files in the 'CWE690_NULL_Deref_From_Return' subfolder.
There are 70 files in the 'CWE590_Free_Memory_Not_on_Heap' subfolder.
There are 59 files in the 'CWE191_Integer_Underflow' subfolder.
There are 31 files in the 'CWE400_Resource_Exhaustion' subfolder.
There are 70 files in the 'CWE134_Uncontrolled_Format_String' subfolder.
There are 65 files in the 'CWE789_Uncontrolled_Mem_Alloc' subfolder.
def count_files_in_folder(root_folder):
= 0
total_files for dirpath, dirnames, filenames in os.walk(root_folder):
+= len(filenames)
total_files print(f"There are {total_files} files in the '{root_folder}' directory.")
print("In the 'train' directory:")
'train')
count_files_in_folder(
print("\nIn the 'test' directory:")
'test')
count_files_in_folder(
## In the 'train' directory:
### There are 6102 files in the 'train' directory.
## In the 'test' directory:
### There are 2781 files in the 'test' directory.
In the 'train' directory:
There are 6102 files in the 'train' directory.
In the 'test' directory:
There are 1526 files in the 'test' directory.
## cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
# !tar -czf cpp_ready_8750_files_each_350_top_25_cwe_omitted.tar.gz cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
# !tar -czf cpp_ready_8750_files_each_350_top_25_cwe_omitted.tar.gz cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
!tar -czf cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz train test
......
# !wget https://samate.nist.gov/SARD/downloads/test-suites/2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
# keras.utils.text_dataset_from_directory(
# directory,
# labels="inferred",
# label_mode="int",
# class_names=None,
# batch_size=32,
# max_length=None,
# shuffle=True,
# seed=None,
# validation_split=None,
# subset=None,
# follow_links=False,
# )
download a dataset of programming TestCases from SARD. Each TestCase
is labeled with exactly one tag (CWE-ID#121-***
,
CWE-ID#122-***
, etc).
!mv cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz _cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
## !wget https://raw.githubusercontent.com/c6ai/temp/main/cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
# !gdown 1j1nY2qlLnA_Iap0_ug8ZAQuDgKX1QIJB
## !gdown 1Q_P8bYpvdSEbp6NnCzfqU3lwQwxUlfE3
# !wget https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
# data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/*.tar.gz'
= 'https://raw.githubusercontent.com/c6ai/temp/main/cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz'
data_url
= utils.get_file(
dataset_dir =data_url,
origin=True,
untar='cache_dir',
cache_dir='')
cache_subdir
= pathlib.Path(dataset_dir).parent dataset_dir
Downloading data from https://raw.githubusercontent.com/c6ai/temp/main/cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
899225/899225 [==============================] - 0s 0us/step
list(dataset_dir.iterdir())
[PosixPath('/tmp/.keras/cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz'),
PosixPath('/tmp/.keras/test'),
PosixPath('/tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted'),
PosixPath('/tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz'),
PosixPath('/tmp/.keras/train')]
= dataset_dir/'train'
train_dir list(train_dir.iterdir())
[PosixPath('/tmp/.keras/train/CWE124_Buffer_Underwrite'),
PosixPath('/tmp/.keras/train/CWE122_Heap_Based_Buffer_Overflow'),
PosixPath('/tmp/.keras/train/CWE194_Unexpected_Sign_Extension'),
PosixPath('/tmp/.keras/train/CWE401_Memory_Leak'),
PosixPath('/tmp/.keras/train/CWE126_Buffer_Overread'),
PosixPath('/tmp/.keras/train/CWE36_Absolute_Path_Traversal'),
PosixPath('/tmp/.keras/train/CWE457_Use_of_Uninitialized_Variable'),
PosixPath('/tmp/.keras/train/CWE195_Signed_to_Unsigned_Conversion_Error'),
PosixPath('/tmp/.keras/train/CWE197_Numeric_Truncation_Error'),
PosixPath('/tmp/.keras/train/CWE78_OS_Command_Injection'),
PosixPath('/tmp/.keras/train/CWE563_Unused_Variable'),
PosixPath('/tmp/.keras/train/CWE127_Buffer_Underread'),
PosixPath('/tmp/.keras/train/CWE680_Integer_Overflow_to_Buffer_Overflow'),
PosixPath('/tmp/.keras/train/CWE23_Relative_Path_Traversal'),
PosixPath('/tmp/.keras/train/CWE121_Stack_Based_Buffer_Overflow'),
PosixPath('/tmp/.keras/train/CWE369_Divide_by_Zero'),
PosixPath('/tmp/.keras/train/CWE190_Integer_Overflow'),
PosixPath('/tmp/.keras/train/CWE415_Double_Free'),
PosixPath('/tmp/.keras/train/CWE762_Mismatched_Memory_Management_Routines'),
PosixPath('/tmp/.keras/train/CWE690_NULL_Deref_From_Return'),
PosixPath('/tmp/.keras/train/CWE590_Free_Memory_Not_on_Heap'),
PosixPath('/tmp/.keras/train/CWE191_Integer_Underflow'),
PosixPath('/tmp/.keras/train/CWE400_Resource_Exhaustion'),
PosixPath('/tmp/.keras/train/CWE134_Uncontrolled_Format_String'),
PosixPath('/tmp/.keras/train/CWE789_Uncontrolled_Mem_Alloc')]
The train/CWE-ID#121-***
,
train/CWE-ID#122-***
, ... directories contain many text
files, each of which is a SARD TestCase.
## ValueError: No text files found in directory /tmp/.keras/train. Allowed format: .txt
# import glob
def change_file_extension(root_folder, old_ext, new_ext):
for filename in glob.iglob(root_folder + '**/*' + old_ext, recursive=True):
= os.path.splitext(filename)[0]
base + new_ext)
os.rename(filename, base
'/tmp/.keras/', '.cpp', '.txt') change_file_extension(
# !ls /tmp/.keras/train/CWE122_Heap_Based_Buffer_Overflow
## print(open(data_dir / "CWE122_Heap_Based_Buffer_Overflow" / "83_goodG2B.cpp").read())
# sample_file = train_dir/'python/1755.txt'
# sample_file = train_dir/'CWE122_Heap_Based_Buffer_Overflow/83_goodG2B.cpp'
= train_dir/'CWE122_Heap_Based_Buffer_Overflow/83_goodG2B.txt'
sample_file
with open(sample_file) as f:
print(f.read())
#ifndef OMITGOOD
#include "std_testcase.h"
#include "83.h"
namespace 83
{
83_goodG2B::83_goodG2B(int dataCopy)
{
data = dataCopy;
data = 7;
}
83_goodG2B::~83_goodG2B()
{
{
int i;
int * buffer = new int[10];
for (i = 0; i < 10; i++)
{
buffer[i] = 0;
}
if (data >= 0)
{
buffer[data] = 1;
for(i = 0; i < 10; i++)
{
printIntLine(buffer[i]);
}
}
else
{
printLine("ERROR: Array index is negative.");
}
delete[] buffer;
}
}
}
#endif
The tf.keras.utils.text_dataset_from_directory API expects a directory structure as follows:
train/
...CWE-ID#121-***/
......1.txt
......2.txt
...CWE-ID#122-***/
......1.txt
......2.txt
...CWE-ID#1xx-***/
......1.txt
......2.txt
...CWE-ID#1x-***n/
......1.txt
......2.txt
= batch_size
batch_size = seed
seed
= utils.text_dataset_from_directory(
raw_train_ds
train_dir,=batch_size,
batch_size=0.2,
validation_split='training',
subset=seed)
seed
## Found 6102 files belonging to 25 classes.
## Using 4882 files for training.
Found 6102 files belonging to 25 classes.
Using 4882 files for training.
for text_batch, label_batch in raw_train_ds.take(1):
for i in range(10):
print("TestCase: ", text_batch.numpy()[i])
print("Label:", label_batch.numpy()[i])
TestCase: b'\n\n#ifndef OMITGOOD\n\n#include "std_testcase.h"\n#include "file_snprintf_81.h"\n\n#ifdef _WIN32\n#define SNPRINTF _snprintf\n#else\n#define SNPRINTF snprintf\n#endif\n\nnamespace file_snprintf_81\n{\n\nvoid file_snprintf_81_goodB2G::action(char * data) const\n{\n {\n char dest[100] = "";\n \n SNPRINTF(dest, 100-1, "%s", data);\n printLine(dest);\n }\n}\n\n}\n#endif \n'
Label: 5
TestCase: b'\n\n\n#include "std_testcase.h"\n\n#ifdef _WIN32\n#define BASEPATH "c:\\\\temp\\\\"\n#else\n#include <wchar.h>\n#define BASEPATH "/tmp/"\n#endif\n\n#ifdef _WIN32\n#include <winsock2.h>\n#include <windows.h>\n#include <direct.h>\n#pragma comment(lib, "ws2_32") \n#define CLOSE_SOCKET closesocket\n#else \n#include <sys/types.h>\n#include <sys/socket.h>\n#include <netinet/in.h>\n#include <arpa/inet.h>\n#include <unistd.h>\n#define INVALID_SOCKET -1\n#define SOCKET_ERROR -1\n#define CLOSE_SOCKET close\n#define SOCKET int\n#endif\n\n#define TCP_PORT 27015\n#define IP_ADDRESS "127.0.0.1"\n\n\nnamespace connect_socket_w32CreateFile_53\n{\n\n#ifndef OMITBAD\n\n\nvoid badSink_b(char * data);\n\nvoid bad()\n{\n char * data;\n char dataBuffer[FILENAME_MAX] = BASEPATH;\n data = dataBuffer;\n {\n#ifdef _WIN32\n WSADATA wsaData;\n int wsaDataInit = 0;\n#endif\n int recvResult;\n struct sockaddr_in service;\n char *replace;\n SOCKET connectSocket = INVALID_SOCKET;\n size_t dataLen = strlen(data);\n do\n {\n#ifdef _WIN32\n if (WSAStartup(MAKEWORD(2,2), &wsaData) != NO_ERROR)\n {\n break;\n }\n wsaDataInit = 1;\n#endif\n \n connectSocket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);\n if (connectSocket == INVALID_SOCKET)\n {\n break;\n }\n memset(&service, 0, sizeof(service));\n service.sin_family = AF_INET;\n service.sin_addr.s_addr = inet_addr(IP_ADDRESS);\n service.sin_port = htons(TCP_PORT);\n if (connect(connectSocket, (struct sockaddr*)&service, sizeof(service)) == SOCKET_ERROR)\n {\n break;\n }\n \n \n recvResult = recv(connectSocket, (char *)(data + dataLen), sizeof(char) * (FILENAME_MAX - dataLen - 1), 0);\n if (recvResult == SOCKET_ERROR || recvResult == 0)\n {\n break;\n }\n \n data[dataLen + recvResult / sizeof(char)] = \'\\0\';\n \n replace = strchr(data, \'\\r\');\n if (replace)\n {\n *replace = \'\\0\';\n }\n replace = strchr(data, \'\\n\');\n if (replace)\n {\n *replace = \'\\0\';\n }\n }\n while (0);\n if (connectSocket != INVALID_SOCKET)\n {\n CLOSE_SOCKET(connectSocket);\n }\n#ifdef _WIN32\n if (wsaDataInit)\n {\n WSACleanup();\n }\n#endif\n }\n badSink_b(data);\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nvoid goodG2BSink_b(char * data);\n\n\nstatic void goodG2B()\n{\n char * data;\n char dataBuffer[FILENAME_MAX] = BASEPATH;\n data = dataBuffer;\n \n strcat(data, "file.txt");\n goodG2BSink_b(data);\n}\n\nvoid good()\n{\n goodG2B();\n}\n\n#endif \n\n} \n\n\n\n#ifdef INCLUDEMAIN\n\nusing namespace connect_socket_w32CreateFile_53; \n\nint main(int argc, char * argv[])\n{\n \n srand( (unsigned)time(NULL) );\n#ifndef OMITGOOD\n printLine("Calling good()...");\n good();\n printLine("Finished good()");\n#endif \n#ifndef OMITBAD\n printLine("Calling bad()...");\n bad();\n printLine("Finished bad()");\n#endif \n return 0;\n}\n\n#endif\n'
Label: 11
TestCase: b'\n\n\n#include "std_testcase.h"\n\n#include <wchar.h>\n\nnamespace char_alloca_65\n{\n\n#ifndef OMITBAD\n\n\nvoid badSink(char * data);\n\nvoid bad()\n{\n char * data;\n \n void (*funcPtr) (char *) = badSink;\n data = NULL; \n {\n \n char * dataBuffer = (char *)ALLOCA(sizeof(char));\n *dataBuffer = \'A\';\n data = dataBuffer;\n }\n \n funcPtr(data);\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nvoid goodG2BSink(char * data);\n\nstatic void goodG2B()\n{\n char * data;\n void (*funcPtr) (char *) = goodG2BSink;\n data = NULL; \n {\n \n char * dataBuffer = new char;\n *dataBuffer = \'A\';\n data = dataBuffer;\n }\n funcPtr(data);\n}\n\nvoid good()\n{\n goodG2B();\n}\n\n#endif \n\n} \n\n\n\n#ifdef INCLUDEMAIN\n\nusing namespace char_alloca_65; \n\nint main(int argc, char * argv[])\n{\n \n srand( (unsigned)time(NULL) );\n#ifndef OMITGOOD\n printLine("Calling good()...");\n good();\n printLine("Finished good()");\n#endif \n#ifndef OMITBAD\n printLine("Calling bad()...");\n bad();\n printLine("Finished bad()");\n#endif \n return 0;\n}\n\n#endif\n'
Label: 19
TestCase: b'\n\n\n#include "std_testcase.h"\n\nnamespace fscanf_62\n{\n\n#ifndef OMITBAD\n\nvoid badSource(int &data)\n{\n \n fscanf(stdin, "%d", &data);\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nvoid goodG2BSource(int &data)\n{\n \n data = 20;\n}\n\n#endif \n\n} \n'
Label: 20
TestCase: b'\n\n#ifndef OMITBAD\n\n#include "std_testcase.h"\n#include "listen_socket_square_81.h"\n\n#include <math.h>\n\nnamespace listen_socket_square_81\n{\n\nvoid listen_socket_square_81_bad::action(int data) const\n{\n {\n \n int result = data * data;\n printIntLine(result);\n }\n}\n\n}\n#endif \n'
Label: 6
TestCase: b'\n\n\n#include "std_testcase.h"\n\n#include <wchar.h>\n\n\nstatic const int STATIC_CONST_FIVE = 5;\n\nnamespace class_placement_new_06\n{\n\n#ifndef OMITBAD\n\nvoid bad()\n{\n TwoIntsClass * data;\n data = NULL; \n if(STATIC_CONST_FIVE==5)\n {\n {\n \n char buffer[sizeof(TwoIntsClass)];\n TwoIntsClass * dataBuffer = new(buffer) TwoIntsClass;\n dataBuffer->intOne = 2;\n dataBuffer->intTwo = 2;\n data = dataBuffer;\n }\n }\n printIntLine(data->intOne);\n \n delete data;\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nstatic void goodG2B1()\n{\n TwoIntsClass * data;\n data = NULL; \n if(STATIC_CONST_FIVE!=5)\n {\n \n printLine("Benign, fixed string");\n }\n else\n {\n {\n \n TwoIntsClass * dataBuffer = new TwoIntsClass;\n dataBuffer->intOne = 2;\n dataBuffer->intTwo = 2;\n data = dataBuffer;\n }\n }\n printIntLine(data->intOne);\n \n delete data;\n}\n\n\nstatic void goodG2B2()\n{\n TwoIntsClass * data;\n data = NULL; \n if(STATIC_CONST_FIVE==5)\n {\n {\n \n TwoIntsClass * dataBuffer = new TwoIntsClass;\n dataBuffer->intOne = 2;\n dataBuffer->intTwo = 2;\n data = dataBuffer;\n }\n }\n printIntLine(data->intOne);\n \n delete data;\n}\n\nvoid good()\n{\n goodG2B1();\n goodG2B2();\n}\n\n#endif \n\n} \n\n\n\n#ifdef INCLUDEMAIN\n\nusing namespace class_placement_new_06; \n\nint main(int argc, char * argv[])\n{\n \n srand( (unsigned)time(NULL) );\n#ifndef OMITGOOD\n printLine("Calling good()...");\n good();\n printLine("Finished good()");\n#endif \n#ifndef OMITBAD\n printLine("Calling bad()...");\n bad();\n printLine("Finished bad()");\n#endif \n return 0;\n}\n\n#endif\n'
Label: 19
TestCase: b'\n\n#ifndef OMITBAD\n\n#include "std_testcase.h"\n#include "array_free_struct_81.h"\n\nnamespace array_free_struct_81\n{\n\nvoid array_free_struct_81_bad::action(twoIntsStruct * data) const\n{\n \n free(data);\n}\n\n}\n#endif \n'
Label: 22
TestCase: b'\n\n\n#include "std_testcase.h"\n\n#ifdef _WIN32\n#define BASEPATH L"c:\\\\temp\\\\"\n#else\n#include <wchar.h>\n#define BASEPATH L"/tmp/"\n#endif\n\n#ifdef _WIN32\n#define FILENAME "C:\\\\temp\\\\file.txt"\n#else\n#define FILENAME "/tmp/file.txt"\n#endif\n\n#ifdef _WIN32\n#define FOPEN _wfopen\n#else\n#define FOPEN fopen\n#endif\n\nnamespace t_file_fopen_45\n{\n\nstatic wchar_t * badData;\nstatic wchar_t * goodG2BData;\n\n#ifndef OMITBAD\n\nstatic void badSink()\n{\n wchar_t * data = badData;\n {\n FILE *pFile = NULL;\n \n pFile = FOPEN(data, L"wb+");\n if (pFile != NULL)\n {\n fclose(pFile);\n }\n }\n}\n\nvoid bad()\n{\n wchar_t * data;\n wchar_t dataBuffer[FILENAME_MAX] = BASEPATH;\n data = dataBuffer;\n {\n \n size_t dataLen = wcslen(data);\n FILE * pFile;\n \n if (FILENAME_MAX-dataLen > 1)\n {\n pFile = fopen(FILENAME, "r");\n if (pFile != NULL)\n {\n \n if (fgetws(data+dataLen, (int)(FILENAME_MAX-dataLen), pFile) == NULL)\n {\n printLine("fgetws() failed");\n \n data[dataLen] = L\'\\0\';\n }\n fclose(pFile);\n }\n }\n }\n badData = data;\n badSink();\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nstatic void goodG2BSink()\n{\n wchar_t * data = goodG2BData;\n {\n FILE *pFile = NULL;\n \n pFile = FOPEN(data, L"wb+");\n if (pFile != NULL)\n {\n fclose(pFile);\n }\n }\n}\n\nstatic void goodG2B()\n{\n wchar_t * data;\n wchar_t dataBuffer[FILENAME_MAX] = BASEPATH;\n data = dataBuffer;\n \n wcscat(data, L"file.txt");\n goodG2BData = data;\n goodG2BSink();\n}\n\nvoid good()\n{\n goodG2B();\n}\n\n#endif \n\n} \n\n\n\n#ifdef INCLUDEMAIN\n\nusing namespace t_file_fopen_45; \n\nint main(int argc, char * argv[])\n{\n \n srand( (unsigned)time(NULL) );\n#ifndef OMITGOOD\n printLine("Calling good()...");\n good();\n printLine("Finished good()");\n#endif \n#ifndef OMITBAD\n printLine("Calling bad()...");\n bad();\n printLine("Finished bad()");\n#endif \n return 0;\n}\n\n#endif\n'
Label: 11
TestCase: b'\n\n\n#include "std_testcase.h"\n\n#include <wchar.h>\n\nnamespace char_memcpy_22\n{\n\n#ifndef OMITBAD\n\n\nextern int badGlobal;\n\nchar * badSource(char * data)\n{\n if(badGlobal)\n {\n {\n char * dataBuffer = new char[100];\n memset(dataBuffer, \'A\', 100-1);\n dataBuffer[100-1] = \'\\0\';\n \n data = dataBuffer - 8;\n }\n }\n return data;\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nextern int goodG2B1Global;\nextern int goodG2B2Global;\n\n\nchar * goodG2B1Source(char * data)\n{\n if(goodG2B1Global)\n {\n \n printLine("Benign, fixed string");\n }\n else\n {\n {\n char * dataBuffer = new char[100];\n memset(dataBuffer, \'A\', 100-1);\n dataBuffer[100-1] = \'\\0\';\n \n data = dataBuffer;\n }\n }\n return data;\n}\n\n\nchar * goodG2B2Source(char * data)\n{\n if(goodG2B2Global)\n {\n {\n char * dataBuffer = new char[100];\n memset(dataBuffer, \'A\', 100-1);\n dataBuffer[100-1] = \'\\0\';\n \n data = dataBuffer;\n }\n }\n return data;\n}\n\n#endif \n\n} \n'
Label: 4
TestCase: b'\n\n\n#include "std_testcase.h"\n\n#include <wchar.h>\n\n\nstatic int staticFive = 5;\n\nnamespace array_int64_t_static_07\n{\n\n#ifndef OMITBAD\n\nvoid bad()\n{\n int64_t * data;\n data = NULL; \n if(staticFive==5)\n {\n {\n \n static int64_t dataBuffer[100];\n {\n size_t i;\n for (i = 0; i < 100; i++)\n {\n dataBuffer[i] = 5LL;\n }\n }\n data = dataBuffer;\n }\n }\n printLongLongLine(data[0]);\n \n delete [] data;\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nstatic void goodG2B1()\n{\n int64_t * data;\n data = NULL; \n if(staticFive!=5)\n {\n \n printLine("Benign, fixed string");\n }\n else\n {\n {\n \n int64_t * dataBuffer = new int64_t[100];\n {\n size_t i;\n for (i = 0; i < 100; i++)\n {\n dataBuffer[i] = 5LL;\n }\n }\n data = dataBuffer;\n }\n }\n printLongLongLine(data[0]);\n \n delete [] data;\n}\n\n\nstatic void goodG2B2()\n{\n int64_t * data;\n data = NULL; \n if(staticFive==5)\n {\n {\n \n int64_t * dataBuffer = new int64_t[100];\n {\n size_t i;\n for (i = 0; i < 100; i++)\n {\n dataBuffer[i] = 5LL;\n }\n }\n data = dataBuffer;\n }\n }\n printLongLongLine(data[0]);\n \n delete [] data;\n}\n\nvoid good()\n{\n goodG2B1();\n goodG2B2();\n}\n\n#endif \n\n} \n\n\n\n#ifdef INCLUDEMAIN\n\nusing namespace array_int64_t_static_07; \n\nint main(int argc, char * argv[])\n{\n \n srand( (unsigned)time(NULL) );\n#ifndef OMITGOOD\n printLine("Calling good()...");\n good();\n printLine("Finished good()");\n#endif \n#ifndef OMITBAD\n printLine("Calling bad()...");\n bad();\n printLine("Finished bad()");\n#endif \n return 0;\n}\n\n#endif\n'
Label: 19
for i, label in enumerate(raw_train_ds.class_names):
print("Label", i, "corresponds to", label)
Label 0 corresponds to CWE121_Stack_Based_Buffer_Overflow
Label 1 corresponds to CWE122_Heap_Based_Buffer_Overflow
Label 2 corresponds to CWE124_Buffer_Underwrite
Label 3 corresponds to CWE126_Buffer_Overread
Label 4 corresponds to CWE127_Buffer_Underread
Label 5 corresponds to CWE134_Uncontrolled_Format_String
Label 6 corresponds to CWE190_Integer_Overflow
Label 7 corresponds to CWE191_Integer_Underflow
Label 8 corresponds to CWE194_Unexpected_Sign_Extension
Label 9 corresponds to CWE195_Signed_to_Unsigned_Conversion_Error
Label 10 corresponds to CWE197_Numeric_Truncation_Error
Label 11 corresponds to CWE23_Relative_Path_Traversal
Label 12 corresponds to CWE369_Divide_by_Zero
Label 13 corresponds to CWE36_Absolute_Path_Traversal
Label 14 corresponds to CWE400_Resource_Exhaustion
Label 15 corresponds to CWE401_Memory_Leak
Label 16 corresponds to CWE415_Double_Free
Label 17 corresponds to CWE457_Use_of_Uninitialized_Variable
Label 18 corresponds to CWE563_Unused_Variable
Label 19 corresponds to CWE590_Free_Memory_Not_on_Heap
Label 20 corresponds to CWE680_Integer_Overflow_to_Buffer_Overflow
Label 21 corresponds to CWE690_NULL_Deref_From_Return
Label 22 corresponds to CWE762_Mismatched_Memory_Management_Routines
Label 23 corresponds to CWE789_Uncontrolled_Mem_Alloc
Label 24 corresponds to CWE78_OS_Command_Injection
# Label 0 corresponds to CWE121_Stack_Based_Buffer_Overflow
# Label 1 corresponds to CWE122_Heap_Based_Buffer_Overflow
# Label 2 corresponds to CWE124_Buffer_Underwrite
# Label 3 corresponds to CWE126_Buffer_Overread
# Label 4 corresponds to CWE127_Buffer_Underread
# Label 5 corresponds to CWE134_Uncontrolled_Format_String
# Label 6 corresponds to CWE190_Integer_Overflow
# Label 7 corresponds to CWE191_Integer_Underflow
# Label 8 corresponds to CWE194_Unexpected_Sign_Extension
# Label 9 corresponds to CWE195_Signed_to_Unsigned_Conversion_Error
# Label 10 corresponds to CWE197_Numeric_Truncation_Error
# Label 11 corresponds to CWE23_Relative_Path_Traversal
# Label 12 corresponds to CWE369_Divide_by_Zero
# Label 13 corresponds to CWE36_Absolute_Path_Traversal
# Label 14 corresponds to CWE400_Resource_Exhaustion
# Label 15 corresponds to CWE401_Memory_Leak
# Label 16 corresponds to CWE415_Double_Free
# Label 17 corresponds to CWE457_Use_of_Uninitialized_Variable
# Label 18 corresponds to CWE563_Unused_Variable
# Label 19 corresponds to CWE590_Free_Memory_Not_on_Heap
# Label 20 corresponds to CWE680_Integer_Overflow_to_Buffer_Overflow
# Label 21 corresponds to CWE690_NULL_Deref_From_Return
# Label 22 corresponds to CWE762_Mismatched_Memory_Management_Routines
# Label 23 corresponds to CWE789_Uncontrolled_Mem_Alloc
# Label 24 corresponds to CWE78_OS_Command_Injection
# Create a validation set.
= utils.text_dataset_from_directory(
raw_val_ds
train_dir,=batch_size,
batch_size=0.2,
validation_split='validation',
subset=seed)
seed## Found 6102 files belonging to 25 classes.
## Using 1220 files for validation.
Found 6102 files belonging to 25 classes.
Using 1220 files for validation.
= dataset_dir/'test'
test_dir
# Create a test set.
= utils.text_dataset_from_directory(
raw_test_ds
test_dir,=batch_size)
batch_size## Found 2781 files belonging to 25 classes.
Found 2781 files belonging to 25 classes.
= raw_train_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
raw_train_ds = raw_val_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
raw_val_ds = raw_test_ds.prefetch(buffer_size=tf.data.AUTOTUNE) raw_test_ds
= VOCAB_SIZE
VOCAB_SIZE
# binary_vectorize_layer = TextVectorization(
= TextVectorization(
multi_class_vectorize_layer =VOCAB_SIZE,
max_tokens='binary') output_mode
= MAX_SEQUENCE_LENGTH
MAX_SEQUENCE_LENGTH
= TextVectorization(
int_vectorize_layer =VOCAB_SIZE,
max_tokens='int',
output_mode=MAX_SEQUENCE_LENGTH) output_sequence_length
# %%time
# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
= raw_train_ds.map(lambda text, labels: text)
train_text
multi_class_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)## CPU times: user 1.2 s, sys: 212 ms, total: 1.42 s
## Wall time: 981 ms
= next(iter(raw_train_ds)) text_batch, label_batch
= text_batch[0], label_batch[0]
first_TestCase, first_label print("TestCase:", first_TestCase)
print("Label:", first_label)
## TestCase: tf.Tensor(b'\n\n\n#include "std_testcase.h"\n#include <vector>\n\nusing namespace std;\n\nnamespace t_rand_multiply_72 ...
## Label: tf.Tensor(7, shape=(), dtype=int32)
TestCase: tf.Tensor(b'\n\n\n#include "std_testcase.h"\n\n#ifndef _WIN32\n#include <wchar.h>\n#endif\n\n#define ENV_VARIABLE L"ADD"\n\n#ifdef _WIN32\n#define GETENV _wgetenv\n#else\n#define GETENV getenv\n#endif\n\n#ifdef _WIN32\n#define OPEN _wopen\n#define CLOSE _close\n#else\n#include <unistd.h>\n#define OPEN open\n#define CLOSE close\n#endif\n\nnamespace t_environment_open_54\n{\n\n\n\n#ifndef OMITBAD\n\nvoid badSink_e(wchar_t * data)\n{\n {\n int fileDesc;\n \n fileDesc = OPEN(data, O_RDWR|O_CREAT, S_IREAD|S_IWRITE);\n if (fileDesc != -1)\n {\n CLOSE(fileDesc);\n }\n }\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nvoid goodG2BSink_e(wchar_t * data)\n{\n {\n int fileDesc;\n \n fileDesc = OPEN(data, O_RDWR|O_CREAT, S_IREAD|S_IWRITE);\n if (fileDesc != -1)\n {\n CLOSE(fileDesc);\n }\n }\n}\n\n#endif \n\n} \n', shape=(), dtype=string)
Label: tf.Tensor(13, shape=(), dtype=int32)
# print("'binary' vectorized TestCase:",
# list(multi_class_vectorize_layer(first_TestCase).numpy()))
plt.plot(multi_class_vectorize_layer(first_TestCase).numpy())0,1000)
plt.xlim(
## (0.0, 1000.0)
(0.0, 1000.0)
print("'int' vectorized TestCase:",
int_vectorize_layer(first_TestCase).numpy())
'int' vectorized TestCase: [ 5 24 6 26 5 47 3 14 375 534 18 26 14 166
531 30 14 166 166 3 18 26 14 209 562 14 77 77
30 5 93 14 209 209 14 77 77 3 12 2758 6 16
4 807 2 10 178 178 366 365 363 13 178 19 369 3
6 15 4 797 2 10 178 178 366 365 363 13 178 19
369 3 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0]
# 'int' vectorized TestCase: [ 5 24 5 226 29 12 71 12 1667 6 16 4 787 74
# 69 2 218 299 7 69 120 2 83 465 3 6 15 4
# 851 74 69 2 218 299 7 69 120 2 83 465 4 854
# 74 69 2 218 299 7 13 2 931 69 120 2 83 465
# 30 151 315 91 179 642 287 316 644 3 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 0 0 0 0 0 0 0 0 0 0 0 0]
print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))
## 1289 ---> wchartmemcpy81base
## 313 ---> 6
## Vocabulary size: 9203
1289 ---> wchartmemcpy81base
313 ---> 6
Vocabulary size: 9203
# epochs=2 #10
# new_num_labels=25
# binary_model = tf.keras.Sequential([
= tf.keras.Sequential([
clf_model # binary_vectorize_layer,
multi_class_vectorize_layer,# layers.Dense(4)]) #25
layers.Dense(new_num_labels)])
compile(
clf_model.=losses.SparseCategoricalCrossentropy(from_logits=True),
loss='adam',
optimizer=['accuracy']) metrics
=True)
tf.keras.utils.plot_model(clf_model, show_shapes# ValueError: This model has not yet been built. Build the model first by calling `build()` or by calling the model on a batch of data.
%%time
= clf_model.fit(
bin_history =raw_val_ds, epochs=epochs) #10) # epochs)
raw_train_ds, validation_data
# print()
# raw_train_ds, validation_data=raw_val_ds, epochs=2) # epochs) #epochs=10)
## CPU times: user 3.55 s, sys: 467 ms, total: 4.01 s
## Wall time: 2.21 s
## CPU times: user 13.7 s, sys: 1.73 s, total: 15.4 s
## Wall time: 16.3 s ## 7.31 s
Epoch 1/15
153/153 [==============================] - 2s 11ms/step - loss: 2.5163 - accuracy: 0.5401 - val_loss: 1.9911 - val_accuracy: 0.7434
Epoch 2/15
153/153 [==============================] - 1s 9ms/step - loss: 1.6466 - accuracy: 0.7911 - val_loss: 1.4668 - val_accuracy: 0.7893
Epoch 3/15
153/153 [==============================] - 1s 9ms/step - loss: 1.2317 - accuracy: 0.8429 - val_loss: 1.1847 - val_accuracy: 0.8139
Epoch 4/15
153/153 [==============================] - 1s 8ms/step - loss: 0.9859 - accuracy: 0.8707 - val_loss: 1.0081 - val_accuracy: 0.8279
Epoch 5/15
153/153 [==============================] - 1s 10ms/step - loss: 0.8208 - accuracy: 0.8959 - val_loss: 0.8869 - val_accuracy: 0.8426
Epoch 6/15
153/153 [==============================] - 2s 11ms/step - loss: 0.7014 - accuracy: 0.9121 - val_loss: 0.7984 - val_accuracy: 0.8492
Epoch 7/15
153/153 [==============================] - 1s 9ms/step - loss: 0.6106 - accuracy: 0.9252 - val_loss: 0.7308 - val_accuracy: 0.8566
Epoch 8/15
153/153 [==============================] - 1s 9ms/step - loss: 0.5390 - accuracy: 0.9353 - val_loss: 0.6774 - val_accuracy: 0.8623
Epoch 9/15
153/153 [==============================] - 1s 8ms/step - loss: 0.4811 - accuracy: 0.9437 - val_loss: 0.6342 - val_accuracy: 0.8680
Epoch 10/15
153/153 [==============================] - 1s 9ms/step - loss: 0.4332 - accuracy: 0.9525 - val_loss: 0.5985 - val_accuracy: 0.8697
Epoch 11/15
153/153 [==============================] - 1s 9ms/step - loss: 0.3929 - accuracy: 0.9588 - val_loss: 0.5685 - val_accuracy: 0.8721
Epoch 12/15
153/153 [==============================] - 2s 13ms/step - loss: 0.3585 - accuracy: 0.9629 - val_loss: 0.5429 - val_accuracy: 0.8713
Epoch 13/15
153/153 [==============================] - 1s 8ms/step - loss: 0.3289 - accuracy: 0.9662 - val_loss: 0.5209 - val_accuracy: 0.8779
Epoch 14/15
153/153 [==============================] - 1s 9ms/step - loss: 0.3030 - accuracy: 0.9685 - val_loss: 0.5017 - val_accuracy: 0.8779
Epoch 15/15
153/153 [==============================] - 1s 8ms/step - loss: 0.2803 - accuracy: 0.9699 - val_loss: 0.4848 - val_accuracy: 0.8811
CPU times: user 25.6 s, sys: 1.04 s, total: 26.7 s
Wall time: 29.7 s
# Epoch 1/10
# 153/153 [==============================] - 1s 5ms/step - loss: 1.2311 - accuracy: 0.8449 - val_loss: 1.1815 - val_accuracy: 0.8270
# Epoch 2/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.9858 - accuracy: 0.8744 - val_loss: 1.0057 - val_accuracy: 0.8377
# Epoch 3/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.8211 - accuracy: 0.8996 - val_loss: 0.8850 - val_accuracy: 0.8434
# Epoch 4/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.7018 - accuracy: 0.9146 - val_loss: 0.7968 - val_accuracy: 0.8516
# Epoch 5/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.6111 - accuracy: 0.9273 - val_loss: 0.7295 - val_accuracy: 0.8566
# Epoch 6/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.5395 - accuracy: 0.9373 - val_loss: 0.6764 - val_accuracy: 0.8631
# Epoch 7/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.4816 - accuracy: 0.9461 - val_loss: 0.6335 - val_accuracy: 0.8656
# Epoch 8/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.4337 - accuracy: 0.9537 - val_loss: 0.5980 - val_accuracy: 0.8689
# Epoch 9/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.3934 - accuracy: 0.9588 - val_loss: 0.5681 - val_accuracy: 0.8738
# Epoch 10/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.3590 - accuracy: 0.9635 - val_loss: 0.5427 - val_accuracy: 0.8762
# CPU times: user 13.7 s, sys: 1.73 s, total: 15.4 s
# Wall time: 7.31 s
def create_model(vocab_size, num_labels, vectorizer=None):
=[]
my_layers if vectorizer is not None:
= [vectorizer]
my_layers
my_layers.extend([64, mask_zero=True),
layers.Embedding(vocab_size, 0.5),
layers.Dropout(64, 5, padding="valid", activation="relu", strides=2),
layers.Conv1D(
layers.GlobalMaxPooling1D(),
layers.Dense(num_labels)
])
= tf.keras.Sequential(my_layers)
model return model
# `vocab_size` is `VOCAB_SIZE + 1` ## `0 used for padding additionally.
# int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4, vectorizer=int_vectorize_layer)
= create_model(vocab_size=VOCAB_SIZE + 1, num_labels=new_num_labels, vectorizer=int_vectorize_layer) #num_labels=4
int_model
=True) tf.keras.utils.plot_model(int_model, show_shapes
%%time
compile(
int_model.=losses.SparseCategoricalCrossentropy(from_logits=True),
loss='adam',
optimizer=['accuracy'])
metrics
= int_model.fit(raw_train_ds, validation_data=raw_val_ds, epochs=epochs) # 10 # 15 #30
int_history
# int_history = int_model.fit(raw_train_ds, validation_data=raw_val_ds, epochs=10) # 10
## CPU times: user 1min 12s, sys: 6.5 s, total: 1min 18s
## Wall time: 36.9 s ## 19 s
Epoch 1/15
153/153 [==============================] - 7s 38ms/step - loss: 3.0561 - accuracy: 0.2003 - val_loss: 2.6361 - val_accuracy: 0.4566
Epoch 2/15
153/153 [==============================] - 5s 29ms/step - loss: 2.0293 - accuracy: 0.5244 - val_loss: 1.3931 - val_accuracy: 0.6369
Epoch 3/15
153/153 [==============================] - 6s 38ms/step - loss: 1.1298 - accuracy: 0.6940 - val_loss: 0.8160 - val_accuracy: 0.7893
Epoch 4/15
153/153 [==============================] - 4s 29ms/step - loss: 0.7124 - accuracy: 0.8097 - val_loss: 0.5603 - val_accuracy: 0.8484
Epoch 5/15
153/153 [==============================] - 5s 31ms/step - loss: 0.4967 - accuracy: 0.8644 - val_loss: 0.4404 - val_accuracy: 0.8746
Epoch 6/15
153/153 [==============================] - 5s 34ms/step - loss: 0.3827 - accuracy: 0.8978 - val_loss: 0.3835 - val_accuracy: 0.8836
Epoch 7/15
153/153 [==============================] - 6s 37ms/step - loss: 0.3063 - accuracy: 0.9166 - val_loss: 0.3424 - val_accuracy: 0.8910
Epoch 8/15
153/153 [==============================] - 4s 29ms/step - loss: 0.2523 - accuracy: 0.9254 - val_loss: 0.3256 - val_accuracy: 0.8951
Epoch 9/15
153/153 [==============================] - 6s 37ms/step - loss: 0.2111 - accuracy: 0.9385 - val_loss: 0.3080 - val_accuracy: 0.9000
Epoch 10/15
153/153 [==============================] - 5s 34ms/step - loss: 0.1796 - accuracy: 0.9461 - val_loss: 0.2971 - val_accuracy: 0.8992
Epoch 11/15
153/153 [==============================] - 5s 32ms/step - loss: 0.1619 - accuracy: 0.9478 - val_loss: 0.2930 - val_accuracy: 0.8975
Epoch 12/15
153/153 [==============================] - 4s 29ms/step - loss: 0.1423 - accuracy: 0.9562 - val_loss: 0.2880 - val_accuracy: 0.9008
Epoch 13/15
153/153 [==============================] - 6s 37ms/step - loss: 0.1264 - accuracy: 0.9623 - val_loss: 0.2938 - val_accuracy: 0.8992
Epoch 14/15
153/153 [==============================] - 4s 29ms/step - loss: 0.1131 - accuracy: 0.9650 - val_loss: 0.2860 - val_accuracy: 0.9008
Epoch 15/15
153/153 [==============================] - 4s 29ms/step - loss: 0.1054 - accuracy: 0.9676 - val_loss: 0.2886 - val_accuracy: 0.9000
CPU times: user 1min 45s, sys: 3.28 s, total: 1min 48s
Wall time: 1min 37s
# # Epoch 1/10
# # 153/153 [==============================] - 3s 13ms/step - loss: 1.1379 - accuracy: 0.6780 - val_loss: 0.9188 - val_accuracy: 0.7361
# # Epoch 2/10
# # 153/153 [==============================] - 2s 12ms/step - loss: 0.8141 - accuracy: 0.7638 - val_loss: 0.6918 - val_accuracy: 0.8008
# # Epoch 3/10
# # 153/153 [==============================] - 2s 11ms/step - loss: 0.5948 - accuracy: 0.8265 - val_loss: 0.5431 - val_accuracy: 0.8467
# # Epoch 4/10
# # 153/153 [==============================] - 2s 11ms/step - loss: 0.4447 - accuracy: 0.8734 - val_loss: 0.4490 - val_accuracy: 0.8680
# # Epoch 5/10
# # 153/153 [==============================] - 2s 11ms/step - loss: 0.3474 - accuracy: 0.9007 - val_loss: 0.3991 - val_accuracy: 0.8754
# # Epoch 6/10
# # 153/153 [==============================] - 2s 13ms/step - loss: 0.2838 - accuracy: 0.9177 - val_loss: 0.3596 - val_accuracy: 0.8877
# # Epoch 7/10
# # 153/153 [==============================] - 2s 12ms/step - loss: 0.2363 - accuracy: 0.9283 - val_loss: 0.3366 - val_accuracy: 0.8893
# # Epoch 8/10
# # 153/153 [==============================] - 2s 12ms/step - loss: 0.2001 - accuracy: 0.9359 - val_loss: 0.3241 - val_accuracy: 0.8926
# # Epoch 9/10
# # 153/153 [==============================] - 2s 12ms/step - loss: 0.1706 - accuracy: 0.9469 - val_loss: 0.3140 - val_accuracy: 0.8934
# # Epoch 10/10
# # 153/153 [==============================] - 2s 11ms/step - loss: 0.1559 - accuracy: 0.9502 - val_loss: 0.3109 - val_accuracy: 0.8959
# # CPU times: user 1min 12s, sys: 6.5 s, total: 1min 18s
# # Wall time: 19 s
# Epoch 1/30
# 153/153 [==============================] - 11s 55ms/step - loss: 3.0357 - accuracy: 0.2030 - val_loss: 2.5830 - val_accuracy: 0.4385
# Epoch 2/30
# 153/153 [==============================] - 6s 39ms/step - loss: 1.9925 - accuracy: 0.5104 - val_loss: 1.3674 - val_accuracy: 0.6861
# Epoch 3/30
# 153/153 [==============================] - 7s 48ms/step - loss: 1.0896 - accuracy: 0.7141 - val_loss: 0.7781 - val_accuracy: 0.7967
# Epoch 4/30
# 153/153 [==============================] - 6s 38ms/step - loss: 0.6872 - accuracy: 0.8081 - val_loss: 0.5537 - val_accuracy: 0.8459
# Epoch 5/30
# 153/153 [==============================] - 6s 39ms/step - loss: 0.4933 - accuracy: 0.8587 - val_loss: 0.4498 - val_accuracy: 0.8713
# Epoch 6/30
# 153/153 [==============================] - 6s 41ms/step - loss: 0.3738 - accuracy: 0.8935 - val_loss: 0.3869 - val_accuracy: 0.8877
# Epoch 7/30
# 153/153 [==============================] - 7s 47ms/step - loss: 0.2988 - accuracy: 0.9144 - val_loss: 0.3549 - val_accuracy: 0.8902
# Epoch 8/30
# 153/153 [==============================] - 6s 39ms/step - loss: 0.2461 - accuracy: 0.9228 - val_loss: 0.3320 - val_accuracy: 0.8959
# Epoch 9/30
# 153/153 [==============================] - 6s 39ms/step - loss: 0.2037 - accuracy: 0.9412 - val_loss: 0.3136 - val_accuracy: 0.8975
# Epoch 10/30
# 153/153 [==============================] - 7s 43ms/step - loss: 0.1732 - accuracy: 0.9480 - val_loss: 0.3036 - val_accuracy: 0.9016
# Epoch 11/30
# 153/153 [==============================] - 7s 48ms/step - loss: 0.1539 - accuracy: 0.9504 - val_loss: 0.3005 - val_accuracy: 0.9000
# Epoch 12/30
# 153/153 [==============================] - 6s 38ms/step - loss: 0.1366 - accuracy: 0.9584 - val_loss: 0.2904 - val_accuracy: 0.9025
# Epoch 13/30
# 153/153 [==============================] - 6s 37ms/step - loss: 0.1221 - accuracy: 0.9613 - val_loss: 0.2998 - val_accuracy: 0.9025
# Epoch 14/30
# 153/153 [==============================] - 8s 54ms/step - loss: 0.1096 - accuracy: 0.9625 - val_loss: 0.2899 - val_accuracy: 0.9008
# Epoch 15/30
# 153/153 [==============================] - 7s 45ms/step - loss: 0.0985 - accuracy: 0.9697 - val_loss: 0.2936 - val_accuracy: 0.9025
# Epoch 16/30
# 153/153 [==============================] - 7s 47ms/step - loss: 0.0949 - accuracy: 0.9670 - val_loss: 0.2931 - val_accuracy: 0.9008
# Epoch 17/30
# 153/153 [==============================] - 6s 37ms/step - loss: 0.0841 - accuracy: 0.9713 - val_loss: 0.2982 - val_accuracy: 0.9016
# Epoch 18/30
# 153/153 [==============================] - 7s 43ms/step - loss: 0.0769 - accuracy: 0.9744 - val_loss: 0.2974 - val_accuracy: 0.9049
# Epoch 19/30
# 153/153 [==============================] - 7s 48ms/step - loss: 0.0714 - accuracy: 0.9773 - val_loss: 0.3041 - val_accuracy: 0.8967
# Epoch 20/30
# 153/153 [==============================] - 7s 43ms/step - loss: 0.0670 - accuracy: 0.9783 - val_loss: 0.3038 - val_accuracy: 0.9016
# Epoch 21/30
# 153/153 [==============================] - 6s 37ms/step - loss: 0.0636 - accuracy: 0.9766 - val_loss: 0.3071 - val_accuracy: 0.9049
# Epoch 22/30
# 153/153 [==============================] - 7s 48ms/step - loss: 0.0563 - accuracy: 0.9818 - val_loss: 0.3019 - val_accuracy: 0.9066
# Epoch 23/30
# 153/153 [==============================] - 6s 40ms/step - loss: 0.0523 - accuracy: 0.9836 - val_loss: 0.3086 - val_accuracy: 0.9066
# Epoch 24/30
# 153/153 [==============================] - 7s 47ms/step - loss: 0.0530 - accuracy: 0.9814 - val_loss: 0.3137 - val_accuracy: 0.9057
# Epoch 25/30
# 153/153 [==============================] - 7s 44ms/step - loss: 0.0463 - accuracy: 0.9834 - val_loss: 0.3090 - val_accuracy: 0.9041
# Epoch 26/30
# 153/153 [==============================] - 6s 40ms/step - loss: 0.0478 - accuracy: 0.9836 - val_loss: 0.3241 - val_accuracy: 0.9033
# Epoch 27/30
# 153/153 [==============================] - 6s 39ms/step - loss: 0.0436 - accuracy: 0.9836 - val_loss: 0.3196 - val_accuracy: 0.9041
# Epoch 28/30
# 153/153 [==============================] - 6s 42ms/step - loss: 0.0439 - accuracy: 0.9853 - val_loss: 0.3269 - val_accuracy: 0.9000
# Epoch 29/30
# 153/153 [==============================] - 8s 49ms/step - loss: 0.0398 - accuracy: 0.9855 - val_loss: 0.3288 - val_accuracy: 0.9033
# Epoch 30/30
# 153/153 [==============================] - 6s 37ms/step - loss: 0.0373 - accuracy: 0.9861 - val_loss: 0.3301 - val_accuracy: 0.8984
# CPU times: user 4min 28s, sys: 8.96 s, total: 4min 36s
# Wall time: 4min 32s
= plt.plot(bin_history.epoch, bin_history.history['loss'], label='bin-loss')
loss 'val_loss'], '--', color=loss[0].get_color(), label='bin-val_loss')
plt.plot(bin_history.epoch, bin_history.history[
= plt.plot(int_history.epoch, int_history.history['loss'], label='int-loss')
loss 'val_loss'], '--', color=loss[0].get_color(), label='int-val_loss')
plt.plot(int_history.epoch, int_history.history[
plt.legend()'Epoch')
plt.xlabel('CE/token') plt.ylabel(
Text(0, 0.5, 'CE/token')
# Evaluate the model on the test set
= int_model.evaluate(raw_test_ds)
int_test_loss, int_test_acc
## 87/87 [==============================] - 1s 11ms/step - loss: 0.2339 - accuracy: 0.9259
87/87 [==============================] - 1s 14ms/step - loss: 0.2118 - accuracy: 0.9259
# Print the test loss and accuracy
print("Test loss:", int_test_loss)
print("Test accuracy:", int_test_acc)
## Test loss: 0.23388110101222992
## Test accuracy: 0.9259259104728699
Test loss: 0.21181465685367584
Test accuracy: 0.9259259104728699
# binary_train_ds = raw_train_ds.map(lambda x,y: (binary_vectorize_layer(x), y))
= raw_train_ds.map(lambda x,y: (multi_class_vectorize_layer(x), y))
clf_train_ds = raw_val_ds.map(lambda x,y: (multi_class_vectorize_layer(x), y))
clf_val_ds = raw_test_ds.map(lambda x,y: (multi_class_vectorize_layer(x), y))
clf_test_ds
= raw_train_ds.map(lambda x,y: (int_vectorize_layer(x), y))
int_train_ds = raw_val_ds.map(lambda x,y: (int_vectorize_layer(x), y))
int_val_ds = raw_test_ds.map(lambda x,y: (int_vectorize_layer(x), y)) int_test_ds
'bin.tf') clf_model.export(
Saved artifact at 'bin.tf'. The following endpoints are available:
* Endpoint 'serve'
args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None,), dtype=tf.string, name='text_vectorization_input')
Output Type:
TensorSpec(shape=(None, 25), dtype=tf.float32, name=None)
Captures:
139832539029648: TensorSpec(shape=(), dtype=tf.resource, name=None)
139832326372544: TensorSpec(shape=(), dtype=tf.int64, name=None)
139832326361280: TensorSpec(shape=(), dtype=tf.string, name=None)
139832326361632: TensorSpec(shape=(), dtype=tf.int64, name=None)
139832461989792: TensorSpec(shape=(), dtype=tf.resource, name=None)
139832329937216: TensorSpec(shape=(), dtype=tf.resource, name=None)
= tf.saved_model.load('bin.tf') loaded
'How do you sort a list?']) clf_model.predict([
1/1 [==============================] - 0s 145ms/step
array([[ 0.09657298, -0.49373376, 0.07736832, 0.06607725, 0.05221149,
-0.46978113, -0.48686236, -0.55163133, 0.15191314, 0.23564513,
-0.4884494 , -0.6964581 , 0.24588017, -0.74412024, -0.48752603,
0.3062069 , -0.8510176 , -1.0887347 , -0.4476996 , -0.05396813,
-0.66050076, -0.6878822 , -0.61148196, -0.7419138 , -0.70286953]],
dtype=float32)
'How do you sort a list?'])).numpy() loaded.serve(tf.constant([
array([[ 0.09657298, -0.49373376, 0.07736832, 0.06607725, 0.05221149,
-0.46978113, -0.48686236, -0.55163133, 0.15191314, 0.23564513,
-0.4884494 , -0.6964581 , 0.24588017, -0.74412024, -0.48752603,
0.3062069 , -0.8510176 , -1.0887347 , -0.4476996 , -0.05396813,
-0.66050076, -0.6878822 , -0.61148196, -0.7419138 , -0.70286953]],
dtype=float32)
'cpp_top_25_cwe_cnn_model.tf') clf_model.export(
Saved artifact at 'cpp_top_25_cwe_cnn_model.tf'. The following endpoints are available:
* Endpoint 'serve'
args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None,), dtype=tf.string, name='text_vectorization_input')
Output Type:
TensorSpec(shape=(None, 25), dtype=tf.float32, name=None)
Captures:
139832539029648: TensorSpec(shape=(), dtype=tf.resource, name=None)
139832326372544: TensorSpec(shape=(), dtype=tf.int64, name=None)
139832326361280: TensorSpec(shape=(), dtype=tf.string, name=None)
139832326361632: TensorSpec(shape=(), dtype=tf.int64, name=None)
139832461989792: TensorSpec(shape=(), dtype=tf.resource, name=None)
139832329937216: TensorSpec(shape=(), dtype=tf.resource, name=None)
## cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
!tar -czf cpp_top_25_cwe_cnn_model.tf.tar.gz cpp_top_25_cwe_cnn_model.tf
!ls -lh
total 21M
drwxr-xr-x 4 root root 4.0K Jan 7 10:38 bin.tf
-rw-r--r-- 1 root root 339K Jan 7 10:33 bow_transformer.pk
-rw-r--r-- 1 root root 756K Jan 7 10:35 cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
drwxr-xr-x 27 root root 4.0K Jan 2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
-rw-r--r-- 1 root root 875K Jan 7 10:35 _cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
drwxr-xr-x 12 root root 4.0K Jan 3 17:15 cpp_clean_files_top_10_cwe_omitted
-rw-r--r-- 1 root root 15K Jan 7 10:35 cpp_clean_files_top_10_cwe_omitted.tar.gz
drwxr-xr-x 4 root root 4.0K Jan 7 10:38 cpp_top_25_cwe_cnn_model.tf
-rw-r--r-- 1 root root 878K Jan 7 10:38 cpp_top_25_cwe_cnn_model.tf.tar.gz
-rw-r--r-- 1 root root 11M Jan 7 10:32 data_drop_na.csv
-rw-r--r-- 1 root root 52K Jan 7 10:36 model.png
-rw-r--r-- 1 root root 4.9M Jan 7 10:35 nb_model.pk
-rw-r--r-- 1 root root 2.9M Jan 7 10:35 sard.zip
drwxr-xr-x 27 root root 4.0K Jan 7 10:35 test
-rw-r--r-- 1 root root 201K Jan 7 10:33 tfidf_transformer.pk
drwxr-xr-x 27 root root 4.0K Jan 7 10:35 train
# !wget https://raw.githubusercontent.com/c6ai/temp/main/cpp_top_25_cwe_cnn_model.tf.tar.gz
# !gdown 1j1nY2qlLnA_Iap0_ug8ZAQuDgKX1QIJB
from google.colab import files
# files.download("cpp_ready_8750_files_each_350_top_25_cwe_omitted.tar.gz")
# files.download("cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz")
"cpp_top_25_cwe_cnn_model.tf.tar.gz") files.download(
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
# ## extract or un-tar (unzip):
# !tar -xzf cpp_top_25_cwe_cnn_model.tf.tar.gz
# ## 1st ...
# from google.colab import drive
# drive.mount('/content/drive')
# !mkdir tar_gz_files
# !cp cpp_top_25_cwe_cnn_model.tf.tar.gz tar_gz_files
# !cp tar_gz_files/* /content/drive/MyDrive/1st-SHARED-Data
print("Tensorlfow version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")
## Tensorlfow version: 2.13.1
## Eager mode: True
## GPU is NOT AVAILABLE
## Tensorlfow version: 2.15.0
## Eager mode: True
## T4 GPU is available
## High RAM
Tensorlfow version: 2.15.0
Eager mode: True
GPU is NOT AVAILABLE
%%time
!tar -czf 1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz /content/*
## CPU times: user 611 ms, sys: 77.2 ms, total: 688 ms
## Wall time: 1min 25s
tar: Removing leading `/' from member names
tar: Removing leading `/' from hard link targets
CPU times: user 20.4 ms, sys: 3.04 ms, total: 23.4 ms
Wall time: 1.51 s
!ls -lh
total 31M
-rw-r--r-- 1 root root 9.9M Jan 7 10:38 1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz
drwxr-xr-x 4 root root 4.0K Jan 7 10:38 bin.tf
-rw-r--r-- 1 root root 339K Jan 7 10:33 bow_transformer.pk
-rw-r--r-- 1 root root 756K Jan 7 10:35 cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
drwxr-xr-x 27 root root 4.0K Jan 2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
-rw-r--r-- 1 root root 875K Jan 7 10:35 _cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
drwxr-xr-x 12 root root 4.0K Jan 3 17:15 cpp_clean_files_top_10_cwe_omitted
-rw-r--r-- 1 root root 15K Jan 7 10:35 cpp_clean_files_top_10_cwe_omitted.tar.gz
drwxr-xr-x 4 root root 4.0K Jan 7 10:38 cpp_top_25_cwe_cnn_model.tf
-rw-r--r-- 1 root root 878K Jan 7 10:38 cpp_top_25_cwe_cnn_model.tf.tar.gz
-rw-r--r-- 1 root root 11M Jan 7 10:32 data_drop_na.csv
-rw-r--r-- 1 root root 52K Jan 7 10:36 model.png
-rw-r--r-- 1 root root 4.9M Jan 7 10:35 nb_model.pk
-rw-r--r-- 1 root root 2.9M Jan 7 10:35 sard.zip
drwxr-xr-x 27 root root 4.0K Jan 7 10:35 test
-rw-r--r-- 1 root root 201K Jan 7 10:33 tfidf_transformer.pk
drwxr-xr-x 27 root root 4.0K Jan 7 10:35 train
## import time
## global_start = time.time()
= time.time()
global_end
# print("[T4 GPU & High RAM?] Global Time Duration: " + str(global_end - global_start))
print("Global Time Duration: " + str(global_end - global_start))
## [T4 GPU & High RAM] Global Time Duration: 643
## 643s / 60 is approximately 11 minutes
## 8 mins on CPU & High RAM
## Global Time Duration: 401.64417576789856
Global Time Duration: 401.64417576789856
%%time
## > 1GB !
# from google.colab import files
# files.download("1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz")
##
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.58 µs
As inspiered by our own Prof. Sadawi's Lecture Notes, Course-Work Template ** & related Notebooks; while further extending official TF2 documentation.
** Sadawi, N. (2021). Nsadawi/Advanced-ML-Projects Jupyter Notebook (Original work published 2020) [dc88adb9c256ae34c381a0d3533a586c5906aac8].
!wget https://samate.nist.gov/SARD/downloads/test-suites/2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
--2024-01-07 10:38:18-- https://samate.nist.gov/SARD/downloads/test-suites/2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
Resolving samate.nist.gov (samate.nist.gov)... 129.6.13.19, 2610:20:6005:13::19
Connecting to samate.nist.gov (samate.nist.gov)|129.6.13.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 703487355 (671M) [application/zip]
Saving to: ‘2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip’
2022-08-11-juliet-c 100%[===================>] 670.90M 97.6MB/s in 5.5s
2024-01-07 10:38:24 (123 MB/s) - ‘2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip’ saved [703487355/703487355]
%%time
!unzip 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip -d data # /content/data
## CPU times: user 26.5 s, sys: 5.38 s, total: 31.8 s
## Wall time: 4min 3s
# Streaming output truncated to the last 5000 lines.
# inflating: data/97419-v1.0.0/src/testcases/CWE36_Absolute_Path_Traversal/s04/CWE36_Absolute_Path_Traversal__wchar_t_environment_w32CreateFile_10.cpp
# creating: data/247234-v2.0.0/
# inflating: data/247234-v2.0.0/manifest.sarif
# inflating: data/247234-v2.0.0/Dockerfile
# inflating: data/247234-v2.0.0/Makefile
# extracting: data/247234-v2.0.0/.dockerignore
# extracting: data/247234-v2.0.0/.gitignore
# creating: data/247234-v2.0.0/src/
# creating: data/247234-v2.0.0/src/testcasesupport/
# inflating: data/247234-v2.0.0/src/testcasesupport/std_testcase_io.h
# inflating: data/247234-v2.0.0/src/testcasesupport/io.c
# inflating: data/247234-v2.0.0/src/testcasesupport/std_testcase.h
# creating: data/247234-v2.0.0/src/testcases/
# creating: data/247234-v2.0.0/src/testcases/CWE78_OS_Command_Injection/
# creating: data/247234-v2.0.0/src/testcases/CWE78_OS_Command_Injection/s06/
# inflating: data/247234-v2.0.0/src/testcases/CWE78_OS_Command_Injection/s06/CWE78_OS_Command_Injection__wchar_t_console_w32_spawnvp_34.c
# creating: data/111814-v1.0.0/
# ...
# ...
# Open the zip file
with zipfile.ZipFile('2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip', 'r') as zip_ref:
# Get a list of all file names in the zip file
= zip_ref.namelist()
all_files # Calculate 1% of the total number of files
= int(len(all_files) * 0.01)
one_percent_files # Get the last 1% of the files
= all_files[-one_percent_files:]
last_one_percent_files # Extract only the last 1% of the files
for file in last_one_percent_files:
file, '/content/data')
zip_ref.extract(
print("Extracted the last 1% of files to /content/data")
Extracted the last 1% of files to /content/data
# Initialize an empty list to store the file details
= []
df_files
# Traverse the directory tree
for dirpath, dirs, files in os.walk('/content/data'):
for filename in files:
# Get the file address
= os.path.join(dirpath, filename)
file_address # Append the file name and address to the list
df_files.append([filename, file_address])
# Convert the list to a dataframe
= pd.DataFrame(df_files, columns=['File_Name', 'File_Address'])
df_f
# Save the dataframe to a csv file
'files_tree.csv', index=False) df_f.to_csv(
# Initialize an empty list to store the file details
= []
cpp_list
# Traverse the directory tree
for dirpath, dirs, files in os.walk('/content/data'):
for filename in files:
# Only process files with the '.cpp' extension
if filename.endswith('.cpp'):
# Get the file address
= os.path.join(dirpath, filename)
file_address # Append the file name and address to the list
cpp_list.append([filename, file_address])
# Convert the list to a dataframe
= pd.DataFrame(cpp_list, columns=['File_Name', 'File_Address'])
cpp_df
# Save the dataframe to a csv file
'cpp_files_tree.csv', index=False) cpp_df.to_csv(
# cpp_list[0]
## (411, 2) ## (46401, 2) ## (618185, 2)
cpp_df.shape # cpp_df.head()
(46401, 2)
# df_files.shape
cpp_df.head()
File_Name | File_Address | |
---|---|---|
0 | CWE36_Absolute_Path_Traversal__wchar_t_connect... | /content/data/96809-v1.0.0/src/testcases/CWE36... |
1 | CWE590_Free_Memory_Not_on_Heap__delete_struct_... | /content/data/107819-v1.0.0/src/testcases/CWE5... |
2 | CWE23_Relative_Path_Traversal__char_connect_so... | /content/data/89720-v1.0.0/src/testcases/CWE23... |
3 | CWE23_Relative_Path_Traversal__char_connect_so... | /content/data/89720-v1.0.0/src/testcases/CWE23... |
4 | CWE23_Relative_Path_Traversal__char_connect_so... | /content/data/89720-v1.0.0/src/testcases/CWE23... |
# !mkdir /content/cpp_data
!mkdir -p /content/cpp_data
%%time
# !mkdir -p /content/cpp_data
!awk -F ',' '{if (NR!=1) print $2}' cpp_files_tree.csv | xargs -I {} cp {} /content/cpp_data/
## CPU times: user 417 ms, sys: 54.4 ms, total: 472 ms
## Wall time: 1min 10s
CPU times: user 929 ms, sys: 94 ms, total: 1.02 s
Wall time: 2min
# !mv /content/cpp_files_all_raw.zip /content/cpp_files_all_raw_partial.zip
# Initialize an empty set to store the unique prefixes
= set()
subfolders_set
# Traverse the directory tree
for dirpath, dirs, files in os.walk('/content/cpp_data'):
for filename in files:
# Extract the prefix from the file name
= re.match(r'(.*)__', filename)
prefix if prefix:
# Add the prefix to the set
1))
subfolders_set.add(prefix.group(
# Convert the set to a list
= list(subfolders_set)
subfolders_list
print(subfolders_list)
['CWE675_Duplicate_Operations_on_Resource', 'CWE758_Undefined_Behavior', 'CWE123_Write_What_Where_Condition', 'CWE590_Free_Memory_Not_on_Heap', 'CWE690_NULL_Deref_From_Return', 'CWE126_Buffer_Overread', 'CWE195_Signed_to_Unsigned_Conversion_Error', 'CWE773_Missing_Reference_to_Active_File_Descriptor_or_Handle', 'CWE563_Unused_Variable', 'CWE606_Unchecked_Loop_Condition', 'CWE127_Buffer_Underread', 'CWE400_Resource_Exhaustion', 'CWE114_Process_Control', 'CWE762_Mismatched_Memory_Management_Routines', 'CWE416_Use_After_Free', 'CWE121_Stack_Based_Buffer_Overflow', 'CWE390_Error_Without_Action', 'CWE591_Sensitive_Data_Storage_in_Improperly_Locked_Memory', 'CWE396_Catch_Generic_Exception', 'CWE789_Uncontrolled_Mem_Alloc', 'CWE500_Public_Static_Field_Not_Final', 'CWE23_Relative_Path_Traversal', 'CWE476_NULL_Pointer_Dereference', 'CWE134_Uncontrolled_Format_String', 'CWE672_Operation_on_Resource_After_Expiration_or_Release', 'CWE562_Return_of_Stack_Variable_Address', 'CWE256_Plaintext_Storage_of_Password', 'CWE319_Cleartext_Tx_Sensitive_Info', 'CWE197_Numeric_Truncation_Error', 'CWE194_Unexpected_Sign_Extension', 'CWE397_Throw_Generic_Exception', 'CWE415_Double_Free', 'CWE90_LDAP_Injection', 'CWE457_Use_of_Uninitialized_Variable', 'CWE843_Type_Confusion', 'CWE369_Divide_by_Zero', 'CWE680_Integer_Overflow_to_Buffer_Overflow', 'CWE176_Improper_Handling_of_Unicode_Encoding', 'CWE404_Improper_Resource_Shutdown', 'CWE427_Uncontrolled_Search_Path_Element', 'CWE190_Integer_Overflow', 'CWE321_Hard_Coded_Cryptographic_Key', 'CWE122_Heap_Based_Buffer_Overflow', 'CWE468_Incorrect_Pointer_Scaling', 'CWE588_Attempt_to_Access_Child_of_Non_Structure_Pointer', 'CWE426_Untrusted_Search_Path', 'CWE401_Memory_Leak', 'CWE78_OS_Command_Injection', 'CWE124_Buffer_Underwrite', 'CWE191_Integer_Underflow', 'CWE775_Missing_Release_of_File_Descriptor_or_Handle', 'CWE259_Hard_Coded_Password', 'CWE665_Improper_Initialization', 'CWE676_Use_of_Potentially_Dangerous_Function', 'CWE15_External_Control_of_System_or_Configuration_Setting', 'CWE440_Expected_Behavior_Violation', 'CWE464_Addition_of_Data_Structure_Sentinel', 'CWE617_Reachable_Assertion', 'CWE36_Absolute_Path_Traversal', 'CWE761_Free_Pointer_Not_at_Start_of_Buffer']
= len(subfolders_list)
num_items print(num_items)
## 42 ## 60
60
# Initialize an empty list to store the prefixes
= []
prefixes
# Traverse the directory tree
for dirpath, dirs, files in os.walk('/content/cpp_data'):
for filename in files:
# Extract the prefix from the file name
= re.match(r'(.*)__', filename)
prefix if prefix:
# Add the prefix to the list
1))
prefixes.append(prefix.group(
# Count the number of files for each prefix
= collections.Counter(prefixes)
prefix_counts
# Convert the counter to a dataframe
= pd.DataFrame.from_dict(prefix_counts, orient='index').reset_index()
cpp_folders_count_df = ['CWE-ID', 'Files-Count']
cpp_folders_count_df.columns
print(cpp_folders_count_df)
CWE-ID Files-Count
0 CWE23_Relative_Path_Traversal 3900
1 CWE121_Stack_Based_Buffer_Overflow 1965
2 CWE426_Untrusted_Search_Path 88
3 CWE401_Memory_Leak 1476
4 CWE762_Mismatched_Memory_Management_Routines 6092
5 CWE690_NULL_Deref_From_Return 440
6 CWE415_Double_Free 1308
7 CWE122_Heap_Based_Buffer_Overflow 5974
8 CWE126_Buffer_Overread 912
9 CWE36_Absolute_Path_Traversal 3900
10 CWE127_Buffer_Underread 1416
11 CWE400_Resource_Exhaustion 390
12 CWE134_Uncontrolled_Format_String 1560
13 CWE124_Buffer_Underwrite 1416
14 CWE590_Free_Memory_Not_on_Heap 3321
15 CWE789_Uncontrolled_Mem_Alloc 1080
16 CWE78_OS_Command_Injection 2200
17 CWE190_Integer_Overflow 1404
18 CWE476_NULL_Pointer_Dereference 158
19 CWE195_Signed_to_Unsigned_Conversion_Error 528
20 CWE114_Process_Control 264
21 CWE617_Reachable_Assertion 132
22 CWE680_Integer_Overflow_to_Buffer_Overflow 600
23 CWE416_Use_After_Free 370
24 CWE606_Unchecked_Loop_Condition 260
25 CWE90_LDAP_Injection 220
26 CWE675_Duplicate_Operations_on_Resource 104
27 CWE457_Use_of_Uninitialized_Variable 463
28 CWE191_Integer_Underflow 858
29 CWE404_Improper_Resource_Shutdown 176
30 CWE563_Unused_Variable 388
31 CWE676_Use_of_Potentially_Dangerous_Function 18
32 CWE591_Sensitive_Data_Storage_in_Improperly_Lo... 44
33 CWE369_Divide_by_Zero 468
34 CWE427_Uncontrolled_Search_Path_Element 220
35 CWE761_Free_Pointer_Not_at_Start_of_Buffer 264
36 CWE194_Unexpected_Sign_Extension 528
37 CWE390_Error_Without_Action 18
38 CWE197_Numeric_Truncation_Error 396
39 CWE588_Attempt_to_Access_Child_of_Non_Structur... 76
40 CWE773_Missing_Reference_to_Active_File_Descri... 66
41 CWE396_Catch_Generic_Exception 54
42 CWE775_Missing_Release_of_File_Descriptor_or_H... 66
43 CWE665_Improper_Initialization 90
44 CWE321_Hard_Coded_Cryptographic_Key 44
45 CWE15_External_Control_of_System_or_Configurat... 22
46 CWE758_Undefined_Behavior 216
47 CWE319_Cleartext_Tx_Sensitive_Info 104
48 CWE672_Operation_on_Resource_After_Expiration_... 81
49 CWE843_Type_Confusion 26
50 CWE123_Write_What_Where_Condition 66
51 CWE256_Plaintext_Storage_of_Password 52
52 CWE259_Hard_Coded_Password 44
53 CWE464_Addition_of_Data_Structure_Sentinel 22
54 CWE176_Improper_Handling_of_Unicode_Encoding 26
55 CWE397_Throw_Generic_Exception 20
56 CWE468_Incorrect_Pointer_Scaling 1
57 CWE440_Expected_Behavior_Violation 1
58 CWE500_Public_Static_Field_Not_Final 2
59 CWE562_Return_of_Stack_Variable_Address 1
# cpp_cwe_top_25_df = cpp_folders_count_df.nlargest(25, 'Files-Count')
= cpp_folders_count_df.nlargest(10, 'Files-Count')
cpp_cwe_top_10_df print(cpp_cwe_top_10_df)
CWE-ID Files-Count
4 CWE762_Mismatched_Memory_Management_Routines 6092
7 CWE122_Heap_Based_Buffer_Overflow 5974
0 CWE23_Relative_Path_Traversal 3900
9 CWE36_Absolute_Path_Traversal 3900
14 CWE590_Free_Memory_Not_on_Heap 3321
16 CWE78_OS_Command_Injection 2200
1 CWE121_Stack_Based_Buffer_Overflow 1965
12 CWE134_Uncontrolled_Format_String 1560
3 CWE401_Memory_Leak 1476
10 CWE127_Buffer_Underread 1416
# %%time
# Create the root folder
'/content/cpp_folders', exist_ok=True)
os.makedirs(
# For each CWE-ID in the top 10/25
for index, row in cpp_cwe_top_10_df.iterrows():
= row['CWE-ID']
cwe_id = row['Files-Count']
files_count
# If the files count is greater than 10/350
# if files_count > 350:
if files_count > 10:
# Create a subfolder for the CWE-ID
f'/content/cpp_folders/{cwe_id}', exist_ok=True)
os.makedirs(
# Get the list of files for the CWE-ID
= [f for f in os.listdir('/content/cpp_data') if f.startswith(cwe_id + '__')]
files
# Copy the initial 10/350 files
# for file in files[:350]:
for file in files[:10]:
'/content/cpp_data', file), f'/content/cpp_folders/{cwe_id}')
shutil.copy(os.path.join(## 2s
# Copy the initial 10/350 files
# for file in files[:350]:
for file in files[:10]:
'/content/cpp_data', file), f'/content/cpp_folders/{cwe_id}')
shutil.copy(os.path.join(## CPU times: user 745 ms, sys: 1.5 s, total: 2.25 s
# Wall time: 2.27 s
# %%time
# Initialize an empty list to store the file details
= []
file_details
# Traverse the directory tree
for dirpath, dirs, files in os.walk('/content/cpp_folders'):
for filename in files:
# Extract the CWE-ID from the directory path
= os.path.basename(dirpath)
cwe_id # Extract the file short name from the file name
= re.match(r'.*__(.*)\.cpp', filename)
file_short_name if file_short_name:
= file_short_name.group(1)
file_short_name # Append the file details to the list
file_details.append([cwe_id, file_short_name, filename])
# Convert the list to a dataframe
= pd.DataFrame(file_details, columns=['CWE-ID', 'File-Short-Name', 'File-Full-Name'])
cpp_cwe_top_10_list_df
print(cpp_cwe_top_10_list_df.head())
## CPU times: user 43.9 ms, sys: 3.43 ms, total: 47.3 ms
## Wall time: 49.4 ms
CWE-ID File-Short-Name \
0 CWE122_Heap_Based_Buffer_Overflow cpp_CWE805_wchar_t_loop_84_bad
1 CWE122_Heap_Based_Buffer_Overflow cpp_CWE806_char_snprintf_81_goodG2B
2 CWE122_Heap_Based_Buffer_Overflow c_src_char_cpy_72b
3 CWE122_Heap_Based_Buffer_Overflow c_CWE129_fgets_81_bad
4 CWE122_Heap_Based_Buffer_Overflow cpp_CWE805_wchar_t_memmove_68b
File-Full-Name
0 CWE122_Heap_Based_Buffer_Overflow__cpp_CWE805_...
1 CWE122_Heap_Based_Buffer_Overflow__cpp_CWE806_...
2 CWE122_Heap_Based_Buffer_Overflow__c_src_char_...
3 CWE122_Heap_Based_Buffer_Overflow__c_CWE129_fg...
4 CWE122_Heap_Based_Buffer_Overflow__cpp_CWE805_...
# cpp_cwe_top_10_list_df = cpp_cwe_top_10_350_files_list_df
## (100, 3) cpp_cwe_top_10_list_df.shape
(100, 3)
# print(cpp_cwe_top_10_350_files_list_df) ## [100 rows x 3 columns] ## [8750 rows x 3 columns]
# 8750/350 ## 25
# print(cpp_cwe_top_25_350_files_list_df) ## [8750 rows x 3 columns]
## 100 rows × 3 columns ## [8750 rows x 3 columns] cpp_cwe_top_10_list_df
CWE-ID | File-Short-Name | File-Full-Name | |
---|---|---|---|
0 | CWE122_Heap_Based_Buffer_Overflow | cpp_CWE805_wchar_t_loop_84_bad | CWE122_Heap_Based_Buffer_Overflow__cpp_CWE805_... |
1 | CWE122_Heap_Based_Buffer_Overflow | cpp_CWE806_char_snprintf_81_goodG2B | CWE122_Heap_Based_Buffer_Overflow__cpp_CWE806_... |
2 | CWE122_Heap_Based_Buffer_Overflow | c_src_char_cpy_72b | CWE122_Heap_Based_Buffer_Overflow__c_src_char_... |
3 | CWE122_Heap_Based_Buffer_Overflow | c_CWE129_fgets_81_bad | CWE122_Heap_Based_Buffer_Overflow__c_CWE129_fg... |
4 | CWE122_Heap_Based_Buffer_Overflow | cpp_CWE805_wchar_t_memmove_68b | CWE122_Heap_Based_Buffer_Overflow__cpp_CWE805_... |
... | ... | ... | ... |
95 | CWE134_Uncontrolled_Format_String | char_file_w32_vsnprintf_82a | CWE134_Uncontrolled_Format_String__char_file_w... |
96 | CWE134_Uncontrolled_Format_String | char_environment_vfprintf_84_goodB2G | CWE134_Uncontrolled_Format_String__char_enviro... |
97 | CWE134_Uncontrolled_Format_String | wchar_t_file_printf_84_bad | CWE134_Uncontrolled_Format_String__wchar_t_fil... |
98 | CWE134_Uncontrolled_Format_String | wchar_t_environment_w32_vsnprintf_84a | CWE134_Uncontrolled_Format_String__wchar_t_env... |
99 | CWE134_Uncontrolled_Format_String | wchar_t_connect_socket_printf_74b | CWE134_Uncontrolled_Format_String__wchar_t_con... |
100 rows × 3 columns
print(cpp_cwe_top_10_list_df['File-Full-Name'][2]) ## CWE122_Heap_Based_Buffer_Overflow__c_CWE129_fgets_62a.cpp
CWE122_Heap_Based_Buffer_Overflow__c_src_char_cpy_72b.cpp
# cpp_cwe_top_25_350_files_list_df['File-Full-Name'] ## [8750 rows x 3 columns]
print(cpp_cwe_top_10_list_df['File-Full-Name'][2]) ## [8750 rows x 3 columns]
CWE122_Heap_Based_Buffer_Overflow__c_src_char_cpy_72b.cpp
= cpp_cwe_top_10_list_df.copy() cpp_files_top_10_cwe_list_df
!cp -r cpp_folders cpp_files_top_10_cwe
!zip cpp_files_top_10_cwe.zip cpp_files_top_10_cwe
adding: cpp_files_top_10_cwe/ (stored 0%)
!mv cpp_files_top_10_cwe.zip temp.zip
!tar -czf cpp_files_top_10_cwe.tar.gz cpp_files_top_10_cwe
!ls -lh
total 767M
-rw-r--r-- 1 root root 9.9M Jan 7 10:38 1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz
-rw-r--r-- 1 root root 671M Aug 11 2022 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
drwxr-xr-x 4 root root 4.0K Jan 7 10:38 bin.tf
-rw-r--r-- 1 root root 339K Jan 7 10:33 bow_transformer.pk
-rw-r--r-- 1 root root 756K Jan 7 10:35 cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
drwxr-xr-x 27 root root 4.0K Jan 2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
-rw-r--r-- 1 root root 875K Jan 7 10:35 _cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
drwxr-xr-x 12 root root 4.0K Jan 3 17:15 cpp_clean_files_top_10_cwe_omitted
-rw-r--r-- 1 root root 15K Jan 7 10:35 cpp_clean_files_top_10_cwe_omitted.tar.gz
drwxr-xr-x 2 root root 4.6M Jan 7 10:46 cpp_data
drwxr-xr-x 12 root root 4.0K Jan 7 10:46 cpp_files_top_10_cwe
-rw-r--r-- 1 root root 29K Jan 7 10:46 cpp_files_top_10_cwe.tar.gz
-rw-r--r-- 1 root root 9.1M Jan 7 10:44 cpp_files_tree.csv
drwxr-xr-x 12 root root 4.0K Jan 7 10:46 cpp_folders
drwxr-xr-x 4 root root 4.0K Jan 7 10:38 cpp_top_25_cwe_cnn_model.tf
-rw-r--r-- 1 root root 878K Jan 7 10:38 cpp_top_25_cwe_cnn_model.tf.tar.gz
drwxr-xr-x 64101 root root 2.1M Jan 7 10:43 data
-rw-r--r-- 1 root root 11M Jan 7 10:32 data_drop_na.csv
-rw-r--r-- 1 root root 49M Jan 7 10:44 files_tree.csv
-rw-r--r-- 1 root root 52K Jan 7 10:36 model.png
-rw-r--r-- 1 root root 4.9M Jan 7 10:35 nb_model.pk
-rw-r--r-- 1 root root 2.9M Jan 7 10:35 sard.zip
-rw-r--r-- 1 root root 192 Jan 7 10:46 temp.zip
drwxr-xr-x 27 root root 4.0K Jan 7 10:35 test
-rw-r--r-- 1 root root 201K Jan 7 10:33 tfidf_transformer.pk
drwxr-xr-x 27 root root 4.0K Jan 7 10:35 train
# !tar -czf cpp_files_top_10_cwe.tar.gz cpp_files_top_10_cwe
# # %%time
# # !tar -czf cpp_files_raw.tar.gz -C /content/cpp_data .
# !tar -czf cpp_all_files_raw.tar.gz -C /content/cpp_data .
# ## CPU times: user 36.3 ms, sys: 3.97 ms, total: 40.3 ms
# ## Wall time: 4.42 s
!ls -lh
total 767M
-rw-r--r-- 1 root root 9.9M Jan 7 10:38 1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz
-rw-r--r-- 1 root root 671M Aug 11 2022 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
drwxr-xr-x 4 root root 4.0K Jan 7 10:38 bin.tf
-rw-r--r-- 1 root root 339K Jan 7 10:33 bow_transformer.pk
-rw-r--r-- 1 root root 756K Jan 7 10:35 cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
drwxr-xr-x 27 root root 4.0K Jan 2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
-rw-r--r-- 1 root root 875K Jan 7 10:35 _cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
drwxr-xr-x 12 root root 4.0K Jan 3 17:15 cpp_clean_files_top_10_cwe_omitted
-rw-r--r-- 1 root root 15K Jan 7 10:35 cpp_clean_files_top_10_cwe_omitted.tar.gz
drwxr-xr-x 2 root root 4.6M Jan 7 10:46 cpp_data
drwxr-xr-x 12 root root 4.0K Jan 7 10:46 cpp_files_top_10_cwe
-rw-r--r-- 1 root root 29K Jan 7 10:46 cpp_files_top_10_cwe.tar.gz
-rw-r--r-- 1 root root 9.1M Jan 7 10:44 cpp_files_tree.csv
drwxr-xr-x 12 root root 4.0K Jan 7 10:46 cpp_folders
drwxr-xr-x 4 root root 4.0K Jan 7 10:38 cpp_top_25_cwe_cnn_model.tf
-rw-r--r-- 1 root root 878K Jan 7 10:38 cpp_top_25_cwe_cnn_model.tf.tar.gz
drwxr-xr-x 64101 root root 2.1M Jan 7 10:43 data
-rw-r--r-- 1 root root 11M Jan 7 10:32 data_drop_na.csv
-rw-r--r-- 1 root root 49M Jan 7 10:44 files_tree.csv
-rw-r--r-- 1 root root 52K Jan 7 10:36 model.png
-rw-r--r-- 1 root root 4.9M Jan 7 10:35 nb_model.pk
-rw-r--r-- 1 root root 2.9M Jan 7 10:35 sard.zip
-rw-r--r-- 1 root root 192 Jan 7 10:46 temp.zip
drwxr-xr-x 27 root root 4.0K Jan 7 10:35 test
-rw-r--r-- 1 root root 201K Jan 7 10:33 tfidf_transformer.pk
drwxr-xr-x 27 root root 4.0K Jan 7 10:35 train
## extract or un-tar (unzip):
# !tar -xzf cpp_files_raw.tar.gz
# Copy the cpp_folders directory to cpp_clean_folders
'cpp_folders', 'cpp_clean_folders')
shutil.copytree(
# Traverse the directory tree
for dirpath, dirs, files in os.walk('cpp_clean_folders'):
for filename in files:
if filename.endswith('.cpp'):
# Get the file path
= os.path.join(dirpath, filename)
file_path with open(file_path, 'r') as file:
= file.read()
file_data # Remove the comments
= re.sub(r'/\*.*?\*/', '', file_data, flags=re.DOTALL)
file_data with open(file_path, 'w') as file:
file.write(file_data)
!cp -r cpp_clean_folders cpp_clean_files_top_10_cwe
!tar -czf cpp_clean_files_top_10_cwe.tar.gz cpp_clean_files_top_10_cwe
# Copy the cpp_folders directory to cpp_clean_folders
'cpp_data', 'cpp_clean_data')
shutil.copytree(
# Traverse the directory tree
for dirpath, dirs, files in os.walk('cpp_clean_data'):
for filename in files:
if filename.endswith('.cpp'):
# Get the file path
= os.path.join(dirpath, filename)
file_path with open(file_path, 'r') as file:
= file.read()
file_data # Remove the comments
= re.sub(r'/\*.*?\*/', '', file_data, flags=re.DOTALL)
file_data with open(file_path, 'w') as file:
file.write(file_data)
# # %%time
# # !tar -czf cpp_files_raw.tar.gz -C /content/cpp_data .
# !tar -czf cpp_all_files_clean.tar.gz -C /content/cpp_clean_data .
# ## CPU times: user 36.3 ms, sys: 3.97 ms, total: 40.3 ms
# ## Wall time: 4.42 s
!cp cpp_all_files_clean.tar.gz cpp_all_clean_files_w_top_10_cwe_titles.tar.gz
cp: cannot stat 'cpp_all_files_clean.tar.gz': No such file or directory
!mkdir tar_gz_files
!cp *.gz tar_gz_files
# from google.colab import drive
# drive.mount('/content/drive')
## /content/tar_gz_files/cpp_clean_files_top_10_cwe.tar.gz
# !cp tar_gz_files/* /content/drive/MyDrive/1st-SHARED-Data
# Copy the cpp_clean_folders directory to cpp_clean_omitted_cwe_folders
'cpp_clean_folders', 'cpp_clean_omitted_cwe_folders')
shutil.copytree(
# Traverse the directory tree
for dirpath, dirs, files in os.walk('cpp_clean_omitted_cwe_folders'):
for filename in files:
if filename.endswith('.cpp'):
# Extract the CWE-ID from the file name
= re.match(r'(.*?)__', filename)
cwe_id if cwe_id:
= cwe_id.group(1)
cwe_id # Get the file path
= os.path.join(dirpath, filename)
file_path with open(file_path, 'r') as file:
= file.read()
file_data # Remove the CWE-ID and only one single word from the file data
= re.sub(cwe_id + '__' + r'\w+?_', '', file_data)
file_data with open(file_path, 'w') as file:
file.write(file_data)
# Remove the CWE-ID and only one single word from the file name
= re.sub(cwe_id + '__' + r'\w+?_', '', filename)
new_filename os.rename(file_path, os.path.join(dirpath, new_filename))
!cp -r cpp_clean_omitted_cwe_folders cpp_clean_files_top_10_cwe_omitted
!tar -czf cpp_clean_files_top_10_cwe_omitted.tar.gz cpp_clean_files_top_10_cwe_omitted
!cp cpp_clean_files_top_10_cwe_omitted.tar.gz tar_gz_files
# ## cpp_clean_files_top_10_cwe_omitted.tar.gz
# !cp tar_gz_files/* /content/drive/MyDrive/1st-SHARED-Data
# from google.colab import files
# files.download("cpp_clean_files_top_10_cwe_omitted.tar.gz")
# ## cpp_clean_files_top_10_cwe_omitted.tar.gz
# !gdown 1YQHdd457W4NjuTvJYiucKUr8pRwbGulj
# /content/cpp_clean_folders/CWE121_Stack_Based_Buffer_Overflow/CWE121_Stack_Based_Buffer_Overflow__CWE129_connect_socket_62a.cpp
## cpp_files_all_raw.zip
%%time
# # # !cd /content/cpp_data && zip /content/cpp_data.zip *.cpp
# # # !cd /content/cpp_data && zip /content/cpp_files_raw.zip *
# # !zip cpp_files_raw.zip /content/cpp_data/*
# # >>>
# # /bin/bash: line 1: /usr/bin/zip: Argument list too long
# # !cd /content/cpp_data && find . -type f -name "*.cpp" -exec zip /content/cpp_files_raw.zip {} \;
# !cd /content/cpp_data && find . -type f -exec zip /content/cpp_files_all_raw.zip {} \;
# ##
CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 5.25 µs
# # !ls /content/cpp_data/*.cpp | wc -l ## Argument list too long
# !ls /content/cpp_data/ | wc -l ## 411 ## 46399
# 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip cpp_data
# cpp_all_clean_files_w_top_10_cwe_titles.tar.gz cpp_files_all_raw.zip
# cpp_all_files_clean.tar.gz cpp_files_top_10_cwe
# cpp_all_files_raw.tar.gz cpp_files_top_10_cwe.tar.gz
# cpp_clean_data cpp_files_tree.csv
# cpp_clean_files_top_10_cwe cpp_folders
# cpp_clean_files_top_10_cwe_omitted data
# cpp_clean_files_top_10_cwe_omitted.tar.gz files_tree.csv
# cpp_clean_files_top_10_cwe.tar.gz sample_data
# cpp_clean_folders tar_gz_files
# cpp_clean_omitted_cwe_folders temp.zip
# !ls
1]
cpp_list[# cpp_df[:1]
['CWE590_Free_Memory_Not_on_Heap__delete_struct_declare_04.cpp',
'/content/data/107819-v1.0.0/src/testcases/CWE590_Free_Memory_Not_on_Heap/s03/CWE590_Free_Memory_Not_on_Heap__delete_struct_declare_04.cpp']
# cpp_list[0]
# cpp_df[:1]
1:] cpp_df[
File_Name | File_Address | |
---|---|---|
1 | CWE590_Free_Memory_Not_on_Heap__delete_struct_... | /content/data/107819-v1.0.0/src/testcases/CWE5... |
2 | CWE23_Relative_Path_Traversal__char_connect_so... | /content/data/89720-v1.0.0/src/testcases/CWE23... |
3 | CWE23_Relative_Path_Traversal__char_connect_so... | /content/data/89720-v1.0.0/src/testcases/CWE23... |
4 | CWE23_Relative_Path_Traversal__char_connect_so... | /content/data/89720-v1.0.0/src/testcases/CWE23... |
5 | CWE773_Missing_Reference_to_Active_File_Descri... | /content/data/116798-v1.0.0/src/testcases/CWE7... |
... | ... | ... |
46396 | CWE134_Uncontrolled_Format_String__wchar_t_fil... | /content/data/81532-v1.0.0/src/testcases/CWE13... |
46397 | CWE134_Uncontrolled_Format_String__char_consol... | /content/data/79670-v1.0.0/src/testcases/CWE13... |
46398 | CWE134_Uncontrolled_Format_String__char_consol... | /content/data/79670-v1.0.0/src/testcases/CWE13... |
46399 | CWE134_Uncontrolled_Format_String__char_consol... | /content/data/79670-v1.0.0/src/testcases/CWE13... |
46400 | CWE134_Uncontrolled_Format_String__char_consol... | /content/data/79670-v1.0.0/src/testcases/CWE13... |
46400 rows × 2 columns
# !gdown --id 1lvT5f-jOADyy2gjjnirzwyMSgNsf9Bz9
# !wget https://samate.nist.gov/SARD/downloads/test-suites/2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
# ## 2022 Juliet C/C++ 1.3.1 with extra support
# https://samate.nist.gov/SARD/test-suites/116
# https://samate.nist.gov/SARD/test-suites/112
# !!!!!!!!!!!!!!1st-C6AI-AIxCC-NLP-Clf-Vulns-Threat-CWE-IDs @ NIST SARD Juliet C++
# https://samate.nist.gov/SARD/test-suites
# from google.colab import drive
# drive.mount('/content/drive')
# !cp 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip /content/drive/MyDrive/1st-SHARED-Data
# !rm -r sample_data
# !mkdir data
# # !gdown --id 1lvT5f-jOADyy2gjjnirzwyMSgNsf9Bz9
# # !wget https://samate.nist.gov/SARD/downloads/test-suites/2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
# from google.colab import drive
# drive.mount('/content/drive')
# # !cp 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip /content/drive/MyDrive/1st-SHARED-Data
# !cp /content/drive/MyDrive/1st-SHARED-Data/2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip .
cpp_df
File_Name | File_Address | |
---|---|---|
0 | CWE36_Absolute_Path_Traversal__wchar_t_connect... | /content/data/96809-v1.0.0/src/testcases/CWE36... |
1 | CWE590_Free_Memory_Not_on_Heap__delete_struct_... | /content/data/107819-v1.0.0/src/testcases/CWE5... |
2 | CWE23_Relative_Path_Traversal__char_connect_so... | /content/data/89720-v1.0.0/src/testcases/CWE23... |
3 | CWE23_Relative_Path_Traversal__char_connect_so... | /content/data/89720-v1.0.0/src/testcases/CWE23... |
4 | CWE23_Relative_Path_Traversal__char_connect_so... | /content/data/89720-v1.0.0/src/testcases/CWE23... |
... | ... | ... |
46396 | CWE134_Uncontrolled_Format_String__wchar_t_fil... | /content/data/81532-v1.0.0/src/testcases/CWE13... |
46397 | CWE134_Uncontrolled_Format_String__char_consol... | /content/data/79670-v1.0.0/src/testcases/CWE13... |
46398 | CWE134_Uncontrolled_Format_String__char_consol... | /content/data/79670-v1.0.0/src/testcases/CWE13... |
46399 | CWE134_Uncontrolled_Format_String__char_consol... | /content/data/79670-v1.0.0/src/testcases/CWE13... |
46400 | CWE134_Uncontrolled_Format_String__char_consol... | /content/data/79670-v1.0.0/src/testcases/CWE13... |
46401 rows × 2 columns
!ls -lh cpp_clean_files_top_10_cwe_omitted ## /content/cpp_clean_files_top_10_cwe_omitted
total 44K
drwxr-xr-x 12 root root 4.0K Jan 7 10:47 cpp_clean_omitted_cwe_folders
drwxr-xr-x 2 root root 4.0K Jan 3 17:15 CWE121_Stack_Based_Buffer_Overflow
drwxr-xr-x 2 root root 4.0K Jan 3 17:15 CWE122_Heap_Based_Buffer_Overflow
drwxr-xr-x 2 root root 4.0K Jan 3 17:15 CWE127_Buffer_Underread
drwxr-xr-x 2 root root 4.0K Jan 3 17:15 CWE134_Uncontrolled_Format_String
drwxr-xr-x 2 root root 4.0K Jan 3 17:15 CWE23_Relative_Path_Traversal
drwxr-xr-x 2 root root 4.0K Jan 3 17:15 CWE36_Absolute_Path_Traversal
drwxr-xr-x 2 root root 4.0K Jan 3 17:15 CWE401_Memory_Leak
drwxr-xr-x 2 root root 4.0K Jan 3 17:15 CWE415_Double_Free
drwxr-xr-x 2 root root 4.0K Jan 3 17:15 CWE590_Free_Memory_Not_on_Heap
drwxr-xr-x 2 root root 4.0K Jan 3 17:15 CWE762_Mismatched_Memory_Management_Routines
!ls -lh
total 773M
-rw-r--r-- 1 root root 9.9M Jan 7 10:38 1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz
-rw-r--r-- 1 root root 671M Aug 11 2022 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
drwxr-xr-x 4 root root 4.0K Jan 7 10:38 bin.tf
-rw-r--r-- 1 root root 339K Jan 7 10:33 bow_transformer.pk
-rw-r--r-- 1 root root 756K Jan 7 10:35 cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
drwxr-xr-x 2 root root 6.5M Jan 7 10:46 cpp_clean_data
drwxr-xr-x 27 root root 4.0K Jan 2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
-rw-r--r-- 1 root root 875K Jan 7 10:35 _cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
drwxr-xr-x 12 root root 4.0K Jan 7 10:46 cpp_clean_files_top_10_cwe
drwxr-xr-x 13 root root 4.0K Jan 7 10:47 cpp_clean_files_top_10_cwe_omitted
-rw-r--r-- 1 root root 31K Jan 7 10:47 cpp_clean_files_top_10_cwe_omitted.tar.gz
-rw-r--r-- 1 root root 18K Jan 7 10:46 cpp_clean_files_top_10_cwe.tar.gz
drwxr-xr-x 12 root root 4.0K Jan 7 10:46 cpp_clean_folders
drwxr-xr-x 12 root root 4.0K Jan 7 10:46 cpp_clean_omitted_cwe_folders
drwxr-xr-x 2 root root 4.6M Jan 7 10:46 cpp_data
drwxr-xr-x 12 root root 4.0K Jan 7 10:46 cpp_files_top_10_cwe
-rw-r--r-- 1 root root 29K Jan 7 10:46 cpp_files_top_10_cwe.tar.gz
-rw-r--r-- 1 root root 9.1M Jan 7 10:44 cpp_files_tree.csv
drwxr-xr-x 12 root root 4.0K Jan 7 10:46 cpp_folders
drwxr-xr-x 4 root root 4.0K Jan 7 10:38 cpp_top_25_cwe_cnn_model.tf
-rw-r--r-- 1 root root 878K Jan 7 10:38 cpp_top_25_cwe_cnn_model.tf.tar.gz
drwxr-xr-x 64101 root root 2.1M Jan 7 10:43 data
-rw-r--r-- 1 root root 11M Jan 7 10:32 data_drop_na.csv
-rw-r--r-- 1 root root 49M Jan 7 10:44 files_tree.csv
-rw-r--r-- 1 root root 52K Jan 7 10:36 model.png
-rw-r--r-- 1 root root 4.9M Jan 7 10:35 nb_model.pk
-rw-r--r-- 1 root root 2.9M Jan 7 10:35 sard.zip
drwxr-xr-x 2 root root 4.0K Jan 7 10:47 tar_gz_files
-rw-r--r-- 1 root root 192 Jan 7 10:46 temp.zip
drwxr-xr-x 27 root root 4.0K Jan 7 10:35 test
-rw-r--r-- 1 root root 201K Jan 7 10:33 tfidf_transformer.pk
drwxr-xr-x 27 root root 4.0K Jan 7 10:35 train
## import time
## global_start = time.time()
= time.time()
global_end
# print("[T4 GPU & High RAM?] Global Time Duration: " + str(global_end - global_start))
print("Global Time Duration: " + str(global_end - global_start))
## [T4 GPU & High RAM] Global Time Duration: 643
## 643s / 60 is approximately 11 minutes
## 8 mins on CPU & High RAM
## Global Time Duration: 940.3888688087463
Global Time Duration: 1051.4102718830109
%%time
## > 1GB !
# from google.colab import files
# files.download("1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz")
##
CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 7.87 µs
# 210107005_UoL_DSM140_NLP_Text_Classification_CW_Sub_v240107wk.ipynb
# Commentable @ https://colab.research.google.com/drive/1kUTphSV9lHhbu_HT_tvffIPEtFWpFPIg?usp=sharing