# 210107005_UoL_DSM140_NLP_Text_Classification_CW_Sub_v240107wk.ipynb
# Commentable @ https://colab.research.google.com/drive/1kUTphSV9lHhbu_HT_tvffIPEtFWpFPIg?usp=sharing

1st UoL DSM140 CW - Report

I. Introduction

1. Introduction to the domain-specific area

The domain-specific area of interest is the application of AI and machine learning techniques for Static Application Security Testing (SAST) and vulnerability detection in critical infrastructure software. This area is particularly relevant to the Artificial Intelligence Cyber Challenge (AIxCC), a competition that encourages the development of AI systems to secure critical code.

In our interconnected world, software underpins everything from financial systems to public utilities. As this code enables modern life and drives productivity, it also creates an expanding attack surface for malicious actors. The AIxCC is a two-year competition that asks the best and brightest in AI and cybersecurity to defend the software on which almost everyone rely. The competition will award a cumulative 30 million in prizes to teams with the best systems, including 7 million in prizes to small businesses to empower entrepreneurial innovation.

The AIxCC is particularly focused on securing open-source software, which comprises most of the code running on critical infrastructure today, including the electricity and telecommunications sectors. The competition is collaborating closely with the open-source community to guide teams in creating AI systems capable of addressing vital cybersecurity issues.

The challenge is to develop innovative systems guided by AI and Machine Learning to semi-automatically find and fix software vulnerabilities [2]. The AIxCC competition will foster innovative research via a gamified environment that challenges the competitors to design Cyber Reasoning Systems (CRSs) that integrate novel AI [4].

In the context of C6AI (Combined C++ Code Cybersecurity & CWE-based Classification AI), the focus is on using text classification methods to analyse and classify C++ code for potential vulnerabilities. This involves converting the raw text of the code into numerical feature vectors that can be processed by machine learning algorithms. Techniques such as text stemming and n-gram tokenization are used in this preprocessing stage.

In summary, the domain-specific area is the intersection of AI, cybersecurity, and software vulnerability detection, with a particular focus on static analysis of C++ code. The goal is to develop AI systems that can effectively identify and address software vulnerabilities, thereby enhancing the security of critical infrastructure.

Ref: [1] https://aicyberchallenge.com/about [2] https://www.sbir.gov/node/2464965 [3] https://www.darpa.mil/news-events/2023-12-14 [4] https://openssf.org/blog/2023/12/19/deconstructing-the-ai-cyber-challenge-aixcc/

2. Description of the selected dataset

The Juliet C/C++ 1.3.1 SARD dataset is a collection of test cases in the C/C++ language, organized under 118 different Common Weakness Enumerations (CWEs). This dataset is part of the Software Assurance Reference Dataset (SARD) provided by the National Institute of Standards and Technology (NIST) as ‘Juliet C/C++ 1.3 with extra support’ @ https://samate.nist.gov/SARD/test-suites/116.

The dataset is designed to test software for potential vulnerabilities and weaknesses. Each test case in the dataset is associated with a specific CWE, which represents a type of software vulnerability. The dataset includes both 'good' and 'bad' examples, with the 'bad' examples demonstrating the vulnerability and the 'good' examples showing a correct or safe way to write the code.

The Juliet C/C++ 1.3.1 SARD dataset is publicly available and not subject to copyright protection. It is made available under the CC0 1.0 Public Domain License.

The dataset is structured in a way that each CWE has its own directory, and within each directory, there are multiple text files, each representing a test case. The test cases are labelled with the CWE-ID, which can be used to identify the type of vulnerability that the test case is associated with.

The 671 MB combressed size of the Juliet C/C++ 1.3 SARD with extra support, contains over 64,099 test cases. Given that the SARD has over 170,000 programs and the Juliet C/C++ 1.3 dataset is a part of this collection, it can be inferred that the dataset is quite large. The data types in the dataset are primarily text, as the test cases are represented as C/C++ code in text files.

The dataset is typically used in machine learning experiments, where it is divided into training, validation, and test sets. The SARD dataset has already been divided into training and test sets, but it lacks a validation set. Therefore, it is common practice to create a validation set using an 80:20 split of the training data.

3. Objectives of the project

The objectives of the project are to enhance the out-of-sample generalization capabilities of the currently developed C6AI Cyber Reasoning System (CRS) and to measure its 'Vulnerability Discovery Accuracy'. This is in line with the AIxCC CRS Areas of Excellence, which emphasize the importance of developing systems that can accurately identify vulnerabilities in software, particularly in the context of critical infrastructure.

The project aims to contribute to the AI Cyber Challenge (AIxCC) by developing a CRS that can effectively and efficiently detect vulnerabilities in C++ code. The focus on out-of-sample generalization is crucial because it ensures that the system can perform well on new, unseen data, which is a common scenario in real-world applications. The ability to generalize well is indicative of a system's robustness and its potential to adapt to evolving cybersecurity threats.

The impact of achieving these objectives is significant. By improving the accuracy of vulnerability discovery, the project directly contributes to the security of critical infrastructure software. This has far-reaching implications for national security, economic stability, and public safety, as critical infrastructure systems are essential to the functioning of society.

Moreover, the project's contributions to the AIxCC challenge could lead to advancements in the field of AI and cybersecurity. By participating in the gamified environment of the competition, the project fosters innovation and encourages the development of new techniques and methodologies in AI-driven cybersecurity.

The potential contributions of the results to the AIxCC challenge could include:

  1. Demonstrating the effectiveness of the C6AI CRS in accurately identifying and classifying software vulnerabilities.
  2. Providing insights into the strengths and weaknesses of the current approaches to SAST and vulnerability detection.
  3. Offering a benchmark for future research and development in the domain of AI-powered cybersecurity solutions.
  4. Encouraging the adoption of AI and machine learning techniques in the cybersecurity industry, particularly for the protection of critical infrastructure.

In summary, the project's objectives are to develop a CRS that excels in out-of-sample generalization and vulnerability discovery accuracy, with the potential to make significant contributions to the AIxCC challenge and the broader field of cybersecurity.

4. Evaluation methodology

The evaluation methodology for the project will involve several key metrics to assess the performance of the C6AI Cyber Reasoning System (CRS) in identifying vulnerabilities in C++ code. These metrics will provide a comprehensive understanding of the system's performance, including its ability to correctly identify vulnerabilities (accuracy), its ability to correctly identify true vulnerabilities (precision), its ability to identify all actual vulnerabilities (recall), and a balanced measure of precision and recall (F-measure).

  1. Accuracy: This is the most intuitive performance measure, and it simply is a ratio of correctly predicted observation to the total observations. It is the ability of the model to correctly identify both vulnerabilities and non-vulnerabilities. It is calculated as (True Positives + True Negatives) / (True Positives + False Positives + True Negatives + False Negatives).

  2. Precision: Precision is the ratio of correctly predicted positive observations to the total predicted positives. It is also called Positive Predictive Value. It is a measure of amongst all the identified vulnerabilities, how many of them are vulnerabilities. It is calculated as True Positives / (True Positives + False Positives).

  3. Recall (Sensitivity): Recall is the ratio of correctly predicted positive observations to all observations in actual class. It is also called Sensitivity, Hit Rate, or True Positive Rate. It is a measure of the ability of the model to identify all possible vulnerabilities. It is calculated as True Positives / (True Positives + False Negatives).

  4. F-Measure (F1 Score): F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account. It is suitable for uneven class distribution problems. It is calculated as 2(Recall Precision) / (Recall + Precision).

The evaluation will be conducted using a test set that the model has not been trained on to ensure an unbiased assessment of the model's performance. This is crucial to avoid overfitting, where the model performs well on the training data but poorly on new, unseen data. The test data will be representative of the real-world data the model will encounter, ensuring the evaluation reflects the model's true predictive performance.

In addition, the project will employ techniques such as cross-validation to further ensure the robustness of the evaluation. In n-Fold cross-validation, the data is divided into n non-overlapping subsets. The model is trained on n-1 subsets and tested on the remaining subset. This process is repeated n times, with each subset used once as the test set. The error estimation is averaged over all n trials to get the total accuracy of the model.

The evaluation methodology will provide a comprehensive understanding of the model's performance, allowing for the identification of areas of strength and potential improvement. This will ultimately contribute to the development of a more accurate and robust Cyber Reasoning System.

II. Implementation

5. Pre-processing

The pre-processing steps for the text classification task in the provided Python file include several steps to convert the raw text data into a format that can be used by machine learning algorithms.

  1. Text Lowercasing: All the text is converted to lower case. This is done to ensure that the algorithm does not treat the same words in different cases as different words.

  2. Punctuation Removal: All punctuation marks are removed from the text. Punctuation does not add any extra information while training the machine learning model. Moreover, removing punctuation reduces the size of the vocabulary and thus increases the speed of training.

  3. Stop Words Removal: Stop words are the most common words in a language like 'the', 'a', 'on', 'is', 'all'. These words do not carry important meaning and are usually removed from texts. The Python file uses a list of English stop words from the NLTK library.

  4. Stemming: Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form. The Python file uses the Snowball Stemmer from the NLTK library.

  5. N-gram Tokenization: The text is tokenized into n-grams. N-grams are contiguous sequences of n items from a given sample of text or speech. This helps to capture the context and semantic meanings of phrases.

  6. Vectorization: The tokenized text is then converted into numerical vectors which can be used as input to the machine learning algorithm. The Python file uses the bag-of-words model to convert the text into vectors. The bag-of-words model represents each text as a vector in a high-dimensional space, where each unique word in the text is represented by one dimension, and the value in that dimension represents the frequency of the word in the text.

The Python file reads .cpp files as text into a pandas dataframe. The vocabulary is built from the unique words in the text after applying the pre-processing steps.

6. Baseline performance

The Naive Bayes classifier was chosen as the baseline for the C6AI Cyber Reasoning System (CRS) project due to its simplicity, efficiency, and proven effectiveness in text classification tasks. This classifier was implemented using the SciKit Learn library, as shown in the attached Python file.

The Naive Bayes classifier was selected as the baseline because it is a well-established algorithm in the field of text classification and has been used extensively in previous research, including by our lecturer for Statistical Data Mining Dr. Noureddin Sadawi. It is a probabilistic classifier that makes use of Bayes' theorem with strong independence assumptions between the features. It is particularly suited for high-dimensional datasets, like text data, and is known for its efficiency and scalability.

The 0.74 (+/- 0.03) MultinomialNB Accuracy performance of the Naive Bayes classifier provides a meaningful benchmark for comparison with the more complex Convolutional Neural Network (CNN) model. The CNN model, implemented using the Keras library, is expected to outperform the Naive Bayes classifier due to its ability to capture local dependencies in the data and its capacity for hierarchical feature learning. However, the Naive Bayes classifier provides a valuable point of reference to evaluate the degree of improvement achieved with the CNN model.

In conclusion, the Naive Bayes classifier was chosen as the baseline due to its simplicity, efficiency, and proven effectiveness in text classification tasks. Its performance provides a meaningful benchmark for comparison with the more complex CNN model.

#                                               precision    recall  f1-score   support

#           CWE121_Stack_Based_Buffer_Overflow       0.91      0.71      0.80       324
#            CWE122_Heap_Based_Buffer_Overflow       0.87      0.66      0.75       316
#                     CWE124_Buffer_Underwrite       0.63      0.85      0.72       331
#                       CWE126_Buffer_Overread       0.86      0.92      0.89       335
#                      CWE127_Buffer_Underread       0.92      0.77      0.83       333
#            CWE134_Uncontrolled_Format_String       0.98      0.90      0.94       350
#                      CWE190_Integer_Overflow       0.94      0.79      0.86       298
#                     CWE191_Integer_Underflow       0.95      0.77      0.85       294
#             CWE194_Unexpected_Sign_Extension       1.00      0.59      0.74       165
#   CWE195_Signed_to_Unsigned_Conversion_Error       0.99      0.55      0.71       158
#              CWE197_Numeric_Truncation_Error       0.71      0.90      0.79       350
#                CWE23_Relative_Path_Traversal       0.91      0.98      0.95       350
#                        CWE369_Divide_by_Zero       0.83      0.91      0.87       350
#                CWE36_Absolute_Path_Traversal       0.82      0.91      0.86       350
#                   CWE400_Resource_Exhaustion       1.00      0.79      0.89       156
#                           CWE401_Memory_Leak       0.81      0.82      0.81       333
#                           CWE415_Double_Free       0.76      0.76      0.76       350
#         CWE457_Use_of_Uninitialized_Variable       0.94      0.97      0.96       297
#                       CWE563_Unused_Variable       0.81      1.00      0.89       350
#               CWE590_Free_Memory_Not_on_Heap       0.88      0.88      0.88       348
#   CWE680_Integer_Overflow_to_Buffer_Overflow       0.98      0.85      0.91       301
#                CWE690_NULL_Deref_From_Return       1.00      0.31      0.47       167
# CWE762_Mismatched_Memory_Management_Routines       0.79      0.85      0.82       349
#                CWE789_Uncontrolled_Mem_Alloc       0.57      0.97      0.72       323
#                   CWE78_OS_Command_Injection       1.00      0.95      0.97       350

#                                     accuracy                           0.84      7628
#                                    macro avg       0.87      0.81      0.83      7628
#                                 weighted avg       0.86      0.84      0.84      7628

7. Classification approach

The C6AI Cyber Reasoning System (CRS) project used a Naive Bayes classifier for text classification, specifically for identifying vulnerabilities in software code. The features used in the classifier were derived from the 'Test-Case-Code' and the labels were the 'CWE-ID'.

The 'Test-Case-Code' features were chosen because they represent the actual code snippets that could potentially contain vulnerabilities. These features were transformed into a bag-of-words representation and then weighted using TF-IDF (Term Frequency-Inverse Document Frequency). This transformation was crucial in converting the raw text into a numerical format that the classifier could process.

The 'CWE-ID' [a standard identifier for software vulnerabilities, allowing the results to be easily interpreted] was used as the target label because it represents the specific type of vulnerability present in the code. The classifier was trained to predict this label based on the features derived from the 'Test-Case-Code'.

The Naive Bayes classifier was chosen for its simplicity and efficiency in text classification tasks. It was implemented using the SciKit Learn library for the baseline model. For the final model, a Convolutional Neural Network (CNN) was built using the Keras library. CNNs are known for their effectiveness in text classification tasks, as they can capture local dependencies in the text and can manage variable-length inputs.

The Python script used for training and evaluating the classifier was designed to be easily understood and modified, enhancing the project's reproducibility. This means that the approach can be replicated by others using different programming languages, development environments, libraries, and algorithms.

8. Coding style

The Python code provided is to adhere to several key coding conventions, which are crucial for maintaining high-quality, readable, and maintainable code.

  1. Indentation: The code uses consistent indentation, which is a fundamental aspect of Python syntax and crucial for code readability.

  2. Variable Naming: The code uses meaningful names for variables, which makes the code more understandable and maintainable. For example, porter_stemmer, stop_words, and global_start are all descriptive variable names that give a clear indication of their purpose in the code.

  3. Use of Libraries: The code makes extensive use of libraries, including nltk, tensorflow, keras, numpy, pandas, and sklearn, among others. This is a good practice as it leverages existing, well-tested functionality and can make the code more concise and efficient.

  4. Comments: The code includes numerous comments, which are essential for explaining the purpose of code blocks, the functionality of functions, and the meaning of variables. This is a good practice as it makes the code more understandable for others (and for the original coder at a later date).

  5. Avoiding Magic Numbers: The code defines several constants at the beginning (like epochs, batch_size, seed, etc.), which is a good practice as it avoids the use of unnamed numerical constants ("magic numbers") in the code. This makes the code more readable and easier to modify.

  6. Code Organization: The code is well-organized, with clear sections for importing libraries, setting up variables, defining functions, and executing code. This organization makes the code easier to follow and understand.

In summary, the code in the provided Python file appears to follow good coding practices, including consistent indentation, meaningful variable names, extensive use of libraries, comprehensive comments, avoidance of magic numbers, and clear organization. These practices contribute to making the code high-quality, readable, and maintainable.

III. Outcome Conclusions

9. Evaluation

The evaluation of the C6AI Cyber Reasoning System (CRS) classifier was performed using the Python scripts provided in this notebook. Those scripts finally use a baseline-beating CNN model after initially using multiple common-sense models and statistical data mining algorithms [starting with the Naive Bayes classifier algorithm] to train various baseline models and then makes predictions on the entire dataset.

The script uses the following metrics for evaluation:

  1. Accuracy: This metric measures the ratio of correctly predicted observations to the total observations. It is the ability of the model to correctly identify both vulnerabilities and non-vulnerabilities.

  2. Precision: This metric measures the ratio of correctly predicted positive observations to the total predicted positives. It is a measure of amongst all the identified vulnerabilities, how many of them are vulnerabilities.

  3. Recall (Sensitivity): This metric measures the ratio of correctly predicted positive observations to all observations in the actual class. It is a measure of the ability of the model to identify all possible vulnerabilities.

  4. F-Measure (F1 Score): This metric is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.

The script uses the SciKit Learn's built-in classification report to return these metrics.

The results of the evaluation provide a quantitative measure of the performance of the CRS classifier. By comparing these results with a suitable baseline, we can assess the improvement achieved by our approach. The specific values of these metrics depended on the actual data used for training and testing the classifier [as shown and validated in the rest of the notebook].

# Evaluated the model on the test set
## 87/87 [==============================] - 1s 11ms/step - loss: 0.2339 - accuracy: 0.9259
## Test loss: 0.23388110101222992
## Test accuracy: 0.9259259104728699

10. Summary of the project and its results

Baseline-beating CNN Model Test Accuracy was 0.925 (in contrast to the 0.74 (+/- 0.03) MultinomialNB Accuracy or the Accuracy: 0.87 (+/- 0.02) RandomForestClassifier Accuracy); nevertheless, the C6AI Cyber Reasoning System (CRS) project once further developed [using creative advances out of scope for this elementary NLP assignment] could make significant contributions to the field of text classification, particularly in the context of identifying vulnerabilities in software code. While the project ultimately employed a CNN model [amongst others], its initial choice was a Naive Bayes classifier, as a popular choice for text classification tasks due to its simplicity and efficiency. The classifier was trained and evaluated using Python-based SciKit Learn's scripts, which were designed to be easily understood and modified, enhancing the project's reproducibility.

The project's preprocessing steps, including the transformation of text into a bag-of-words representation and the use of TF-IDF weighting, were crucial in preparing the data for the classifier. These steps converted the raw text into a format that the classifier could process, and they could be readily adapted for other text classification tasks in different domains.

The CRS classifier demonstrated robust performance across several evaluation metrics, including accuracy, precision, recall, and F1 score. These metrics provide a comprehensive assessment of the classifier's performance, considering both its ability to correctly identify vulnerabilities and its ability to avoid false positives and false negatives.

The project's approach is highly transferable to other domain-specific areas that involve text classification. The preprocessing steps and the Naive Bayes classifier can be applied to any text data, provided that the data is labelled for supervised learning. Furthermore, the Python script can be easily modified to accommodate different data sources, classification algorithms, or evaluation metrics.

The project's approach can also be replicated using different programming languages, development environments, libraries, and algorithms. The key steps of the approach, including text preprocessing, classifier training, and performance evaluation, are common tasks in machine learning and natural language processing, and they can be implemented in many programming languages that support these tasks, such as R, Java, or C++. Similarly, different development environments or libraries [such the new KerasNLP or Jax/PyTorch-backed Keras 3.0 API] can be used to provide the necessary functionalities for these tasks.

While the Naive Bayes classifier baseline was effective, the more complex CNN model achieves higher performance on related tasks or datasets. Alas, such model also has drawbacks, such as increased computational cost and the risk of overfitting. Therefore, the choice of classifier should be guided by the specific requirements and constraints of each task.

1st UoL DSM140 CW - Code

CONTENT_PATH='/content/'
# !cd $CONTENT_PATH && ls
!ls
!rm -r sample_data
sample_data

Env Prepping

Import Libraries

# import codecs,collections,csv,glob,io,itertools,json,logging,nltk,pathlib,\
# pickle,pprint,pytest,re,requests,shutil,string,sys,unicodedata,warnings,zipfile
import codecs,collections,csv,glob,io,itertools,json,logging,nltk,os,pathlib,\
pickle,pprint,pytest,re,requests,shutil,string,sys,time,unicodedata,warnings,zipfile
# import time
## from time import time
global_start = time.time()
## TensorFlow backend only supports string inputs
# import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import keras
from keras import layers
!pip install -q "tensorflow-text" # ==2.13.*"
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.2/5.2 MB 9.8 MB/s eta 0:00:00
# !pip install -q "tensorflow-text" # ==2.13.*"
import tensorflow_text as tf_text

# import keras
import tensorflow as tf
import tensorflow.data as tf_data
import tensorflow_datasets as tfds

# from keras import layers
from tensorflow.keras import Model
# from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import utils
from tensorflow.keras.layers import Conv1D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import MaxPooling1D
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.layers import concatenate
from tensorflow.keras.utils import plot_model
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns; sns.set()

import sklearn.feature_extraction.text
import sklearn.metrics
from matplotlib import pyplot as plt
from pandas.core.frame import DataFrame
from numpy.testing import assert_array_equal
from google.colab import files
from tempfile import NamedTemporaryFile
from tqdm.notebook import tqdm
from typing import KeysView

from os import listdir
from os.path import isfile, join
from operator import itemgetter
from optparse import OptionParser
from collections import Counter
from collections import defaultdict
from collections import namedtuple
from collections import OrderedDict
from scipy.cluster.hierarchy import dendrogram
from scipy.sparse import csr_matrix
from scipy.special import logit
from scipy.stats.distributions import uniform
from gensim.models import word2vec
from gensim.models import Word2Vec
from gensim.models.phrases import Phraser
from gensim.models.phrases import Phrases

from keras import initializers, regularizers, constraints, optimizers, layers
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer

from nltk import ngrams
# from nltk.util import ngrams
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import SnowballStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

from sklearn import datasets
from sklearn import metrics
from sklearn import preprocessing
from sklearn.base import BaseEstimator
from sklearn.base import RegressorMixin
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import dump_svmlight_file
from sklearn.datasets import fetch_20newsgroups
from sklearn.datasets import load_files
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.exceptions import NotFittedError
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import ComplementNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import check_array
from sklearn.utils import check_X_y
from sklearn.utils import shuffle
from sklearn.utils.extmath import density
from sklearn.utils.extmath import log_logistic
from sklearn.utils.multiclass import unique_labels
## Importing basic python libraries

%matplotlib inline
from __future__ import print_function

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
True

Setting static and global variables

epochs=15 #30 #10 #2 #10
new_num_labels=25 #4
batch_size = 32
seed = 0 # 42
VOCAB_SIZE = 10000
MAX_SEQUENCE_LENGTH = 250

# Global values.
WORDS_SIZE=10000
INPUT_SIZE=500
NUM_CLASSES=new_num_labels #5 # 2 # NUM_CLASSES=2
MODEL_NUM=0
EPOCHS=epochs #15 #10

# Preprocessing params.
PRETRAINING_BATCH_SIZE = 128
FINETUNING_BATCH_SIZE = 32
SEQ_LENGTH = 128
MASK_RATE = 0.25
PREDICTIONS_PER_SEQ = 32

# Model params.
NUM_LAYERS = 3
MODEL_DIM = 256
INTERMEDIATE_DIM = 512
NUM_HEADS = 4
DROPOUT = 0.1
NORM_EPSILON = 1e-5

# Training params.
PRETRAINING_LEARNING_RATE = 5e-4
PRETRAINING_EPOCHS = 8
FINETUNING_LEARNING_RATE = 5e-5
FINETUNING_EPOCHS = 3
# Generate random seed
#myrand=np.random.randint(1, 99999 + 1)
rand=seed # 1234 # 71926
np.random.seed(rand)
tf.random.set_seed(rand)
print("Random seed is:", rand)
Random seed is: 0
print("Tensorlfow version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")

## Tensorlfow version:  2.13.1
## Eager mode:  True
## GPU is NOT AVAILABLE

## Tensorlfow version:  2.15.0
## Eager mode:  True
## T4 GPU is available
## High RAM
WARNING:tensorflow:From <ipython-input-16-b8ab0bb2f411>:3: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
Tensorlfow version:  2.15.0
Eager mode:  True
GPU is NOT AVAILABLE

Common-sense/Baseline MODELs:

# Baseline/Common-sense MODELs

1st model per own DSM030 (Statistical Data Mining) CW1

Text-Prepping

porter_stemmer=PorterStemmer()
# stemmer = PorterStemmer()
stemmer=SnowballStemmer("english",ignore_stopwords=True)

def n_vect(text_list, n=3):
    n_vect = []
    for item in ngrams(text_list,n):
        n_vect.append(' '.join(item))
    return n_vect

def process_text(text,n=1):
    ## Check if text is a string
    if not isinstance(text, str):
        text = str(text)

    ## Initiating the tokenised text as an empty list
    tokenised=[]
    word_lemmatize=WordNetLemmatizer()
    stop_words=stopwords.words('english')

    ## Converting text to lower case ## removing all punctuation
    #1. Convert text to lower case and remove all
    text=text.lower()
    # nopunc=[char for char in text if char not in string.punctuation]
    nopunc=[char for char in text if char not in string.punctuation]
    nopunc=''.join(nopunc)

    #2. Remove all stopwords
    removed_stop_words=[word for word in nopunc.split() if word.lower() not in stop_words]

    #3. Apply Stemming
    stemming=[stemmer.stem(word) for word in removed_stop_words]

    #4. Apply Ngram Tokenisation
    tokenised=n_vect(stemming,n)

    #5. Remove non-UTF-8 characters
    tokenised = [word.encode("utf-8", "ignore").decode("utf-8") for word in tokenised]

    # # [OPTIONAL] Applying stemming
    # tokenised=[porter_stemmer.stem(tokenised)] for worg in nltk.word_tokenize(tokenised)] #! text
    # tokenised=[word_lemmatize.lemmatize(tokenised) for word in tokenised if word not in stop_words]
    # tokenised.append([word for word in tokenised.split() if word not in stop_words])
    # vect=sklearn.feature_extraction.text.CountVectorizer(ngram_range=(n,n+1))
    # vect.fit(tokenised) # word #! text #! tokenised #! token
    # vect.get_feature_names()

    ## Returning the tokenised text as a list
    #6. Returns the tokenised text as a list
    return tokenised # .strip()
## Testing on a simple string
process_text("Here we're testing the process_text function, results are as follows:", n=3) # , n=1
['test processtext function',
 'processtext function result',
 'function result follow']
"""
## Download the SARD cpp_8750_files data
"""
# !gdown 1Q_P8bYpvdSEbp6NnCzfqU3lwQwxUlfE3
# !wget https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
{"type":"string"}
# %%time
data_url = 'https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz'

dataset_dir = utils.get_file(
    origin=data_url,
    untar=True,
    cache_dir='cache_dir',
    cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent
## CPU times: user 366 µs, sys: 0 ns, total: 366 µs
## Wall time: 348 µs-
Downloading data from https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
895869/895869 [==============================] - 0s 0us/step
list(dataset_dir.iterdir())
[PosixPath('/tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted'),
 PosixPath('/tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz')]
# Load text files with categories as subfolder names.
# data = datasets.load_files('20news-bydate-test')
data = datasets.load_files('/tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted')
## cd /tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
# !pwd ## /content
!ls -lh /tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
total 464K
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE121_Stack_Based_Buffer_Overflow
drwxr-xr-x 2 root root 16K Jan  2 08:51 CWE122_Heap_Based_Buffer_Overflow
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE124_Buffer_Underwrite
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE126_Buffer_Overread
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE127_Buffer_Underread
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE134_Uncontrolled_Format_String
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE190_Integer_Overflow
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE191_Integer_Underflow
drwxr-xr-x 2 root root 12K Jan  2 04:10 CWE194_Unexpected_Sign_Extension
drwxr-xr-x 2 root root 12K Jan  2 04:10 CWE195_Signed_to_Unsigned_Conversion_Error
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE197_Numeric_Truncation_Error
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE23_Relative_Path_Traversal
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE369_Divide_by_Zero
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE36_Absolute_Path_Traversal
drwxr-xr-x 2 root root 12K Jan  2 04:10 CWE400_Resource_Exhaustion
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE401_Memory_Leak
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE415_Double_Free
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE457_Use_of_Uninitialized_Variable
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE563_Unused_Variable
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE590_Free_Memory_Not_on_Heap
drwxr-xr-x 2 root root 16K Jan  2 04:10 CWE680_Integer_Overflow_to_Buffer_Overflow
drwxr-xr-x 2 root root 12K Jan  2 04:10 CWE690_NULL_Deref_From_Return
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE762_Mismatched_Memory_Management_Routines
drwxr-xr-x 2 root root 20K Jan  2 04:10 CWE789_Uncontrolled_Mem_Alloc
drwxr-xr-x 2 root root 24K Jan  2 04:10 CWE78_OS_Command_Injection
data.keys()
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])
# data.target
print(len(data.target)) ## 7628
print(data.target) ## [11  3 18 ...  5  8  8]

data_targets_list = list(np.unique(data.target))
print(len(data_targets_list))  ## 25 ## number of unique targets
print(data_targets_list)  ## list of unique targets
## [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
7628
[11  3 18 ...  5  8  8]
25
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
data.target_names[13]
{"type":"string"}
data.target_names
['CWE121_Stack_Based_Buffer_Overflow',
 'CWE122_Heap_Based_Buffer_Overflow',
 'CWE124_Buffer_Underwrite',
 'CWE126_Buffer_Overread',
 'CWE127_Buffer_Underread',
 'CWE134_Uncontrolled_Format_String',
 'CWE190_Integer_Overflow',
 'CWE191_Integer_Underflow',
 'CWE194_Unexpected_Sign_Extension',
 'CWE195_Signed_to_Unsigned_Conversion_Error',
 'CWE197_Numeric_Truncation_Error',
 'CWE23_Relative_Path_Traversal',
 'CWE369_Divide_by_Zero',
 'CWE36_Absolute_Path_Traversal',
 'CWE400_Resource_Exhaustion',
 'CWE401_Memory_Leak',
 'CWE415_Double_Free',
 'CWE457_Use_of_Uninitialized_Variable',
 'CWE563_Unused_Variable',
 'CWE590_Free_Memory_Not_on_Heap',
 'CWE680_Integer_Overflow_to_Buffer_Overflow',
 'CWE690_NULL_Deref_From_Return',
 'CWE762_Mismatched_Memory_Management_Routines',
 'CWE789_Uncontrolled_Mem_Alloc',
 'CWE78_OS_Command_Injection']

...

data_target_tuples = []
for text,category,category_names in zip(data['data'], data['target'], data['target_names']):
    decoded = text.decode("cp1252")
    one_line = str.join(" ", decoded.splitlines())
    data_target_tuples.append((one_line, category, category_names))
len(data_target_tuples) ## 25
25
data_tuples = []
for text,category in zip(data['data'], data['target']):
    decoded = text.decode("cp1252")
    one_line = str.join(" ", decoded.splitlines())
    data_tuples.append((one_line, category))
# len(tuple_list) ## 10639 ## 10650
len(data_tuples) ## 7628
7628
df = pd.DataFrame(data_tuples, columns=['Text','Category'])
# data=df
df
Text Category
0 #include "std_testcase.h" #ifdef _WIN32 #d... 11
1 #include "std_testcase.h" #include <wchar.... 3
2 #ifndef OMITGOOD #include "std_testcase.h" ... 18
3 #include "std_testcase.h" #include "environ... 24
4 #include "std_testcase.h" #include <wchar.... 4
... ... ...
7623 #include "std_testcase.h" #include <wchar.... 16
7624 #include "std_testcase.h" #ifdef _WIN32 #d... 11
7625 #include "std_testcase.h" #ifndef _WIN32 #... 5
7626 #include "std_testcase.h" #include <vector>... 8
7627 #include "std_testcase.h" #include <map> u... 8

7628 rows × 2 columns

df
Text Category
0 #include "std_testcase.h" #ifdef _WIN32 #d... 11
1 #include "std_testcase.h" #include <wchar.... 3
2 #ifndef OMITGOOD #include "std_testcase.h" ... 18
3 #include "std_testcase.h" #include "environ... 24
4 #include "std_testcase.h" #include <wchar.... 4
... ... ...
7623 #include "std_testcase.h" #include <wchar.... 16
7624 #include "std_testcase.h" #ifdef _WIN32 #d... 11
7625 #include "std_testcase.h" #ifndef _WIN32 #... 5
7626 #include "std_testcase.h" #include <vector>... 8
7627 #include "std_testcase.h" #include <map> u... 8

7628 rows × 2 columns

Further Wrangling & EDA

df.shape ## (10649, 4)
df ## 10649 rows × 4 columns
Text Category
0 #include "std_testcase.h" #ifdef _WIN32 #d... 11
1 #include "std_testcase.h" #include <wchar.... 3
2 #ifndef OMITGOOD #include "std_testcase.h" ... 18
3 #include "std_testcase.h" #include "environ... 24
4 #include "std_testcase.h" #include <wchar.... 4
... ... ...
7623 #include "std_testcase.h" #include <wchar.... 16
7624 #include "std_testcase.h" #ifdef _WIN32 #d... 11
7625 #include "std_testcase.h" #ifndef _WIN32 #... 5
7626 #include "std_testcase.h" #include <vector>... 8
7627 #include "std_testcase.h" #include <map> u... 8

7628 rows × 2 columns

df.dtypes #  dtype: object
# data.shape ## (10649, 4)
Text        object
Category     int64
dtype: object
# df[['Text',"Category"]].describe()
df.describe()
Category
count 7628.000000
mean 11.998820
std 7.343498
min 0.000000
25% 5.000000
50% 12.000000
75% 18.000000
max 24.000000
# df[['Text',"Category"]].value_counts()
print(df.value_counts())

# Text  Category
# 0     10          1
# 5036  10          1
# 5048  8           1
# 5047  9           1
# 5046  8           1
#                  ..
# 2531  2           1
# 2530  3           1
# 2529  3           1
# 2528  4           1
# 7565  5           1
# Length: 7628, dtype: int64
Text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Category
   #include "std_testcase.h"   #define CHAR_ARRAY_SIZE 8  namespace fgets_33 {  #ifndef OMITBAD  void bad() {     short data;     short &dataRef = data;          data = -1;     {         char inputBuffer[CHAR_ARRAY_SIZE] = "";                  if (fgets(inputBuffer, CHAR_ARRAY_SIZE, stdin) != NULL)         {                          data = (short)atoi(inputBuffer);         }         else         {             printLine("fgets() failed.");         }     }     {         short data = dataRef;         {                          char charData = (char)data;             printHexCharLine(charData);         }     } }  #endif   #ifndef OMITGOOD   static void goodG2B() {     short data;     short &dataRef = data;          data = -1;          data = CHAR_MAX-5;     {         short data = dataRef;         {                          char charData = (char)data;             printHexCharLine(charData);         }     } }  void good() {     goodG2B(); }  #endif   }    #ifdef INCLUDEMAIN  using namespace fgets_33;   int main(int argc, char * argv[]) {          srand( (unsigned)time(NULL) ); #ifndef OMITGOOD     printLine("Calling good()...");     good();     printLine("Finished good()"); #endif  #ifndef OMITBAD     printLine("Calling bad()...");     bad();     printLine("Finished bad()"); #endif      return 0; }  #endif                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       10          1
   #include "std_testcase.h" #include <map>  using namespace std;  namespace rand_to_char_74 {  #ifndef OMITBAD   void badSink(map<int, int> dataMap);  void bad() {     int data;     map<int, int> dataMap;          data = -1;          data = RAND32();          dataMap[0] = data;     dataMap[1] = data;     dataMap[2] = data;     badSink(dataMap); }  #endif   #ifndef OMITGOOD     void goodG2BSink(map<int, int> dataMap);  static void goodG2B() {     int data;     map<int, int> dataMap;          data = -1;          data = CHAR_MAX-5;          dataMap[0] = data;     dataMap[1] = data;     dataMap[2] = data;     goodG2BSink(dataMap); }  void good() {     goodG2B(); }  #endif   }     #ifdef INCLUDEMAIN  using namespace rand_to_char_74;   int main(int argc, char * argv[]) {          srand( (unsigned)time(NULL) ); #ifndef OMITGOOD     printLine("Calling good()...");     good();     printLine("Finished good()"); #endif  #ifndef OMITBAD     printLine("Calling bad()...");     bad();     printLine("Finished bad()"); #endif      return 0; }  #endif                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      10          1
   #include "std_testcase.h" #include <map>  using namespace std;  namespace socket_strncpy_74 {  #ifndef OMITBAD  void badSink(map<int, short> dataMap) {          short data = dataMap[2];     {         char source[100];         char dest[100] = "";         memset(source, 'A', 100-1);         source[100-1] = '\0';         if (data < 100)         {                          strncpy(dest, source, data);             dest[data] = '\0';          }         printLine(dest);     } }  #endif   #ifndef OMITGOOD   void goodG2BSink(map<int, short> dataMap) {     short data = dataMap[2];     {         char source[100];         char dest[100] = "";         memset(source, 'A', 100-1);         source[100-1] = '\0';         if (data < 100)         {                          strncpy(dest, source, data);             dest[data] = '\0';          }         printLine(dest);     } }  #endif   }                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              8           1
   #include "std_testcase.h" #include <map>  using namespace std;  namespace socket_strncpy_74 {  #ifndef OMITBAD  void badSink(map<int, int> dataMap) {          int data = dataMap[2];     {         char source[100];         char dest[100] = "";         memset(source, 'A', 100-1);         source[100-1] = '\0';         if (data < 100)         {                          strncpy(dest, source, data);             dest[data] = '\0';          }         printLine(dest);     } }  #endif   #ifndef OMITGOOD   void goodG2BSink(map<int, int> dataMap) {     int data = dataMap[2];     {         char source[100];         char dest[100] = "";         memset(source, 'A', 100-1);         source[100-1] = '\0';         if (data < 100)         {                          strncpy(dest, source, data);             dest[data] = '\0';          }         printLine(dest);     } }  #endif   }                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      9           1
   #include "std_testcase.h" #include <map>  using namespace std;  namespace socket_memmove_74 {  #ifndef OMITBAD  void badSink(map<int, short> dataMap) {          short data = dataMap[2];     {         char source[100];         char dest[100] = "";         memset(source, 'A', 100-1);         source[100-1] = '\0';         if (data < 100)         {                          memmove(dest, source, data);             dest[data] = '\0';          }         printLine(dest);     } }  #endif   #ifndef OMITGOOD   void goodG2BSink(map<int, short> dataMap) {     short data = dataMap[2];     {         char source[100];         char dest[100] = "";         memset(source, 'A', 100-1);         source[100-1] = '\0';         if (data < 100)         {                          memmove(dest, source, data);             dest[data] = '\0';          }         printLine(dest);     } }  #endif   }                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              8           1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ..
   #include "std_testcase.h"  #include <wchar.h>  namespace wchar_t_memmove_53 {    #ifndef OMITBAD   void badSink_c(wchar_t * data);  void badSink_b(wchar_t * data) {     badSink_c(data); }  #endif   #ifndef OMITGOOD   void goodG2BSink_c(wchar_t * data);  void goodG2BSink_b(wchar_t * data) {     goodG2BSink_c(data); }  #endif   }                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    2           1
   #include "std_testcase.h"  #include <wchar.h>  namespace wchar_t_memmove_52 {  #ifndef OMITBAD   void badSink_b(wchar_t * data);  void bad() {     wchar_t * data;     data = NULL;          data = new wchar_t[50];     wmemset(data, L'A', 50-1);      data[50-1] = L'\0';      badSink_b(data); }  #endif   #ifndef OMITGOOD   void goodG2BSink_b(wchar_t * data);   static void goodG2B() {     wchar_t * data;     data = NULL;          data = new wchar_t[100];     wmemset(data, L'A', 100-1);      data[100-1] = L'\0';      goodG2BSink_b(data); }  void good() {     goodG2B(); }  #endif   }     #ifdef INCLUDEMAIN  using namespace wchar_t_memmove_52;   int main(int argc, char * argv[]) {          srand( (unsigned)time(NULL) ); #ifndef OMITGOOD     printLine("Calling good()...");     good();     printLine("Finished good()"); #endif  #ifndef OMITBAD     printLine("Calling bad()...");     bad();     printLine("Finished bad()"); #endif      return 0; }  #endif                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 3           1
   #include "std_testcase.h"  #include <wchar.h>  namespace wchar_t_memmove_52 {    #ifndef OMITBAD  void badSink_c(wchar_t * data) {     {         wchar_t dest[100];         wmemset(dest, L'C', 100-1);         dest[100-1] = L'\0';                   memmove(dest, data, wcslen(dest)*sizeof(wchar_t));         dest[100-1] = L'\0';         printWLine(dest);         delete [] data;     } }  #endif   #ifndef OMITGOOD   void goodG2BSink_c(wchar_t * data) {     {         wchar_t dest[100];         wmemset(dest, L'C', 100-1);         dest[100-1] = L'\0';                   memmove(dest, data, wcslen(dest)*sizeof(wchar_t));         dest[100-1] = L'\0';         printWLine(dest);         delete [] data;     } }  #endif   }                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 3           1
   #include "std_testcase.h"  #include <wchar.h>  namespace wchar_t_memmove_52 {    #ifndef OMITBAD  void badSink_c(wchar_t * data) {     {         wchar_t dest[100];         wmemset(dest, L'C', 100-1);          dest[100-1] = L'\0';                   memmove(dest, data, 100*sizeof(wchar_t));                  dest[100-1] = L'\0';         printWLine(dest);              } }  #endif   #ifndef OMITGOOD   void goodG2BSink_c(wchar_t * data) {     {         wchar_t dest[100];         wmemset(dest, L'C', 100-1);          dest[100-1] = L'\0';                   memmove(dest, data, 100*sizeof(wchar_t));                  dest[100-1] = L'\0';         printWLine(dest);              } }  #endif   }                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             4           1
  #include <vector> #include "std_testcase.h"  #ifndef _WIN32 #include <wchar.h> #endif  using namespace std;  namespace t_console_w32_vsnprintf_72 {  #ifndef OMITBAD   void badSink(vector<wchar_t *> dataVector);  void bad() {     wchar_t * data;     vector<wchar_t *> dataVector;     wchar_t dataBuffer[100] = L"";     data = dataBuffer;     {                  size_t dataLen = wcslen(data);                  if (100-dataLen > 1)         {                          if (fgetws(data+dataLen, (int)(100-dataLen), stdin) != NULL)             {                                  dataLen = wcslen(data);                 if (dataLen > 0 && data[dataLen-1] == L'\n')                 {                     data[dataLen-1] = L'\0';                 }             }             else             {                 printLine("fgetws() failed");                                  data[dataLen] = L'\0';             }         }     }          dataVector.insert(dataVector.end(), 1, data);     dataVector.insert(dataVector.end(), 1, data);     dataVector.insert(dataVector.end(), 1, data);     badSink(dataVector); }  #endif   #ifndef OMITGOOD   void goodG2BSink(vector<wchar_t *> dataVector);  static void goodG2B() {     wchar_t * data;     vector<wchar_t *> dataVector;     wchar_t dataBuffer[100] = L"";     data = dataBuffer;          wcscpy(data, L"fixedstringtest");          dataVector.insert(dataVector.end(), 1, data);     dataVector.insert(dataVector.end(), 1, data);     dataVector.insert(dataVector.end(), 1, data);     goodG2BSink(dataVector); }   void goodB2GSink(vector<wchar_t *> dataVector);  static void goodB2G() {     wchar_t * data;     vector<wchar_t *> dataVector;     wchar_t dataBuffer[100] = L"";     data = dataBuffer;     {                  size_t dataLen = wcslen(data);                  if (100-dataLen > 1)         {                          if (fgetws(data+dataLen, (int)(100-dataLen), stdin) != NULL)             {                                  dataLen = wcslen(data);                 if (dataLen > 0 && data[dataLen-1] == L'\n')                 {                     data[dataLen-1] = L'\0';                 }             }             else             {                 printLine("fgetws() failed");                                  data[dataLen] = L'\0';             }         }     }     dataVector.insert(dataVector.end(), 1, data);     dataVector.insert(dataVector.end(), 1, data);     dataVector.insert(dataVector.end(), 1, data);     goodB2GSink(dataVector); }  void good() {     goodG2B();     goodB2G(); }  #endif   }     #ifdef INCLUDEMAIN  using namespace t_console_w32_vsnprintf_72;   int main(int argc, char * argv[]) {          srand( (unsigned)time(NULL) ); #ifndef OMITGOOD     printLine("Calling good()...");     good();     printLine("Finished good()"); #endif  #ifndef OMITBAD     printLine("Calling bad()...");     bad();     printLine("Finished bad()"); #endif      return 0; }  #endif  5           1
Length: 7628, dtype: int64
df["Category"].value_counts(5)
11    0.045884
16    0.045884
12    0.045884
5     0.045884
10    0.045884
13    0.045884
24    0.045884
18    0.045884
22    0.045752
19    0.045621
3     0.043917
4     0.043655
15    0.043655
2     0.043393
0     0.042475
23    0.042344
1     0.041426
20    0.039460
6     0.039067
17    0.038936
7     0.038542
21    0.021893
8     0.021631
9     0.020713
14    0.020451
Name: Category, dtype: float64
df.groupby('Category').describe()
Text
count unique top freq
Category
0 324 324 #include "std_testcase.h" #include <map> u... 1
1 316 316 #include "std_testcase.h" #ifndef _WIN32 #... 1
2 331 331 #ifndef OMITGOOD #include "std_testcase.h" ... 1
3 335 335 #include "std_testcase.h" #include <wchar.... 1
4 333 333 #include "std_testcase.h" #include <wchar.... 1
5 350 350 #ifndef OMITBAD #include "std_testcase.h" #... 1
6 298 298 #include "std_testcase.h" #include "max_mul... 1
7 294 294 #include "std_testcase.h" #include <map> u... 1
8 165 165 #include "std_testcase.h" namespace socket... 1
9 158 158 #include "std_testcase.h" #include <list> ... 1
10 350 350 #include "std_testcase.h" #include <vector>... 1
11 350 350 #include "std_testcase.h" #ifdef _WIN32 #d... 1
12 350 350 #include "std_testcase.h" #include <vector>... 1
13 350 350 #include "std_testcase.h" #ifndef _WIN32 #... 1
14 156 156 #ifndef OMITGOOD #include "std_testcase.h" ... 1
15 333 333 #include "std_testcase.h" #ifndef _WIN32 #... 1
16 350 350 #include "std_testcase.h" #include <wchar.... 1
17 297 297 #include "std_testcase.h" namespace int_ar... 1
18 350 350 #ifndef OMITGOOD #include "std_testcase.h" ... 1
19 348 348 #include "std_testcase.h" #include <wchar.... 1
20 301 301 #ifndef OMITGOOD #include "std_testcase.h" ... 1
21 167 167 #ifndef OMITBAD #include "std_testcase.h" #... 1
22 349 349 #include "std_testcase.h" namespace int_ma... 1
23 323 323 #include "std_testcase.h" #ifndef _WIN32 #... 1
24 350 350 #include "std_testcase.h" #include "environ... 1
df.head()
Text Category
0 #include "std_testcase.h" #ifdef _WIN32 #d... 11
1 #include "std_testcase.h" #include <wchar.... 3
2 #ifndef OMITGOOD #include "std_testcase.h" ... 18
3 #include "std_testcase.h" #include "environ... 24
4 #include "std_testcase.h" #include <wchar.... 4
df.shape
(7628, 2)
df.describe()
Category
count 7628.000000
mean 11.998820
std 7.343498
min 0.000000
25% 5.000000
50% 12.000000
75% 18.000000
max 24.000000
cat_df=df['Category']
cat_df.head()
0    11
1     3
2    18
3    24
4     4
Name: Category, dtype: int64
data_len=len(df[['Text',"Category"]])
data_count=df[['Text',"Category"]].count()
data_count
Text        7628
Category    7628
dtype: int64
data_missing=data_len-data_count
data_missing
Text        0
Category    0
dtype: int64
data_drop_na=df.dropna()
data_drop_na.to_csv('data_drop_na.csv')
data_drop_na.count()
Text        7628
Category    7628
dtype: int64
data_123f=data_drop_na.groupby(["Category"])['Text'].nunique()
data_123f.count()
data_123f
Category
0     324
1     316
2     331
3     335
4     333
5     350
6     298
7     294
8     165
9     158
10    350
11    350
12    350
13    350
14    156
15    333
16    350
17    297
18    350
19    348
20    301
21    167
22    349
23    323
24    350
Name: Text, dtype: int64
g = data_drop_na.groupby('Category')
data_L1_filtered = g.filter(lambda x: len(x) > 9)  # pandas 0.13.1
data_L1_filtered.count()
Text        7628
Category    7628
dtype: int64
data_filtered = data_L1_filtered

Data Visualization

def process_text_series(series, n=1):
    return series.apply(lambda x: process_text(x, n))

data_filtered['Text'] = process_text_series(data_filtered['Text'], n=1)
data_filtered['Text']

# 0       [includ, stdtestcaseh, ifdef, win32, defin, ba...
# 1       [includ, stdtestcaseh, includ, wcharh, namespa...
# 2       [ifndef, omitgood, includ, stdtestcaseh, inclu...
# 3       [includ, stdtestcaseh, includ, environmentw32s...
# 4       [includ, stdtestcaseh, includ, wcharh, namespa...
#                               ...
# 7623    [includ, stdtestcaseh, includ, wcharh, namespa...
# 7624    [includ, stdtestcaseh, ifdef, win32, defin, ba...
# 7625    [includ, stdtestcaseh, ifndef, win32, includ, ...
# 7626    [includ, stdtestcaseh, includ, vector, defin, ...
# 7627    [includ, stdtestcaseh, includ, map, use, names...
# Name: Text, Length: 7628, dtype: object
0       [includ, stdtestcaseh, ifdef, win32, defin, ba...
1       [includ, stdtestcaseh, includ, wcharh, namespa...
2       [ifndef, omitgood, includ, stdtestcaseh, inclu...
3       [includ, stdtestcaseh, includ, environmentw32s...
4       [includ, stdtestcaseh, includ, wcharh, namespa...
                              ...                        
7623    [includ, stdtestcaseh, includ, wcharh, namespa...
7624    [includ, stdtestcaseh, ifdef, win32, defin, ba...
7625    [includ, stdtestcaseh, ifndef, win32, includ, ...
7626    [includ, stdtestcaseh, includ, vector, defin, ...
7627    [includ, stdtestcaseh, includ, map, use, names...
Name: Text, Length: 7628, dtype: object
data_filtered ## not df ## nor data
Text Category
0 [includ, stdtestcaseh, ifdef, win32, defin, ba... 11
1 [includ, stdtestcaseh, includ, wcharh, namespa... 3
2 [ifndef, omitgood, includ, stdtestcaseh, inclu... 18
3 [includ, stdtestcaseh, includ, environmentw32s... 24
4 [includ, stdtestcaseh, includ, wcharh, namespa... 4
... ... ...
7623 [includ, stdtestcaseh, includ, wcharh, namespa... 16
7624 [includ, stdtestcaseh, ifdef, win32, defin, ba... 11
7625 [includ, stdtestcaseh, ifndef, win32, includ, ... 5
7626 [includ, stdtestcaseh, includ, vector, defin, ... 8
7627 [includ, stdtestcaseh, includ, map, use, names... 8

7628 rows × 2 columns

df_target = data_filtered['Category']
data.target_names
['CWE121_Stack_Based_Buffer_Overflow',
 'CWE122_Heap_Based_Buffer_Overflow',
 'CWE124_Buffer_Underwrite',
 'CWE126_Buffer_Overread',
 'CWE127_Buffer_Underread',
 'CWE134_Uncontrolled_Format_String',
 'CWE190_Integer_Overflow',
 'CWE191_Integer_Underflow',
 'CWE194_Unexpected_Sign_Extension',
 'CWE195_Signed_to_Unsigned_Conversion_Error',
 'CWE197_Numeric_Truncation_Error',
 'CWE23_Relative_Path_Traversal',
 'CWE369_Divide_by_Zero',
 'CWE36_Absolute_Path_Traversal',
 'CWE400_Resource_Exhaustion',
 'CWE401_Memory_Leak',
 'CWE415_Double_Free',
 'CWE457_Use_of_Uninitialized_Variable',
 'CWE563_Unused_Variable',
 'CWE590_Free_Memory_Not_on_Heap',
 'CWE680_Integer_Overflow_to_Buffer_Overflow',
 'CWE690_NULL_Deref_From_Return',
 'CWE762_Mismatched_Memory_Management_Routines',
 'CWE789_Uncontrolled_Mem_Alloc',
 'CWE78_OS_Command_Injection']

Text Pre-processing

# df['Text'].head(5).apply(process_text)
data_filtered['Text'].head(5).apply(process_text)
0    [includ, stdtestcaseh, ifdef, win32, defin, ba...
1    [includ, stdtestcaseh, includ, wcharh, namespa...
2    [ifndef, omitgood, includ, stdtestcaseh, inclu...
3    [includ, stdtestcaseh, includ, environmentw32s...
4    [includ, stdtestcaseh, includ, wcharh, namespa...
Name: Text, dtype: object

Vectorization

%%time
# bow_transformer = CountVectorizer(analyzer=process_text).fit(df['Text'])
bow_transformer = CountVectorizer(analyzer=process_text).fit(data_filtered['Text'])

print(len(bow_transformer.vocabulary_)) ## 12766

## CPU times: user 26.5 s, sys: 84.2 ms, total: 26.6 s
## Wall time: 36 s
12766
CPU times: user 14.7 s, sys: 249 ms, total: 14.9 s
Wall time: 15 s
print(bow_transformer.get_feature_names_out()[114])
print(bow_transformer.get_feature_names_out()[783])

## alloca100sizeofint64t
## arraystructtwointsstruct54
alloca100sizeofint64t
arraystructtwointsstruct54
## ID of a term
# bow_transformer.vocabulary_['chihuahua']
bow_transformer.vocabulary_['arraystructtwointsstruct54'] ## 783
783
%%time
# text_bow = bow_transformer.transform(df['Text'])
text_bow = bow_transformer.transform(data_filtered['Text'])

## CPU times: user 21.5 s, sys: 70.8 ms, total: 21.6 s
## Wall time: 21.8 s
CPU times: user 14.6 s, sys: 238 ms, total: 14.8 s
Wall time: 14.9 s
print('Shape of Sparse Matrix: ', text_bow.shape)
print('Amount of Non-Zero occurences: ', text_bow.nnz)
Shape of Sparse Matrix:  (7628, 12766)
Amount of Non-Zero occurences:  316364
text_bow.shape[0] * text_bow.shape[1]
97379048
sparsity = (100.0 * text_bow.nnz / (text_bow.shape[0] * text_bow.shape[1]))
print('sparsity: {}'.format(sparsity)) ## sparsity: 0.3248789205661571
sparsity: 0.3248789205661571
with open('bow_transformer.pk', 'wb') as bow:
    pickle.dump(bow_transformer, bow)
%%time

tfidf_transformer = TfidfTransformer().fit(text_bow)

# ## CPU times: user 5.72 ms, sys: 941 µs, total: 6.66 ms
# ## Wall time: 13.2 ms
CPU times: user 3.43 ms, sys: 26 µs, total: 3.46 ms
Wall time: 3.47 ms
# %%time
text_tfidf = tfidf_transformer.transform(text_bow)
print(text_tfidf.shape) ## (7628, 12766)
## CPU times: user 25 ms, sys: 3.03 ms, total: 28.1 ms
## Wall time: 27.7 ms
(7628, 12766)
with open('tfidf_transformer.pk', 'wb') as bow:
    pickle.dump(tfidf_transformer, bow)

Text Classification Models

NaiveBayes model

# %%time
# detect_model = MultinomialNB().fit(text_tfidf, df['Category'])= MultinomialNB().fit
detect_model = MultinomialNB().fit(text_tfidf, df_target)

## CPU times: user 26.8 ms, sys: 7.02 ms, total: 33.9 ms
## Wall time: 35.4 ms
# %%time
all_predictions = detect_model.predict(text_tfidf)
## CPU times: user 10.4 ms, sys: 0 ns, total: 10.4 ms
## Wall time: 11.9 ms
target_names = data['target_names']

# print(classification_report(df['Category'], all_predictions, target_names=data['target_names']))
print(classification_report(df_target, all_predictions, target_names=target_names))
                                              precision    recall  f1-score   support

          CWE121_Stack_Based_Buffer_Overflow       0.91      0.70      0.79       324
           CWE122_Heap_Based_Buffer_Overflow       0.87      0.66      0.75       316
                    CWE124_Buffer_Underwrite       0.63      0.85      0.72       331
                      CWE126_Buffer_Overread       0.86      0.92      0.89       335
                     CWE127_Buffer_Underread       0.92      0.77      0.83       333
           CWE134_Uncontrolled_Format_String       0.98      0.90      0.94       350
                     CWE190_Integer_Overflow       0.94      0.79      0.86       298
                    CWE191_Integer_Underflow       0.95      0.77      0.85       294
            CWE194_Unexpected_Sign_Extension       1.00      0.58      0.73       165
  CWE195_Signed_to_Unsigned_Conversion_Error       0.97      0.55      0.70       158
             CWE197_Numeric_Truncation_Error       0.70      0.90      0.79       350
               CWE23_Relative_Path_Traversal       0.91      0.98      0.95       350
                       CWE369_Divide_by_Zero       0.85      0.92      0.88       350
               CWE36_Absolute_Path_Traversal       0.83      0.91      0.87       350
                  CWE400_Resource_Exhaustion       1.00      0.79      0.89       156
                          CWE401_Memory_Leak       0.81      0.82      0.81       333
                          CWE415_Double_Free       0.75      0.76      0.76       350
        CWE457_Use_of_Uninitialized_Variable       0.94      0.97      0.96       297
                      CWE563_Unused_Variable       0.81      1.00      0.90       350
              CWE590_Free_Memory_Not_on_Heap       0.88      0.88      0.88       348
  CWE680_Integer_Overflow_to_Buffer_Overflow       0.99      0.85      0.91       301
               CWE690_NULL_Deref_From_Return       1.00      0.31      0.47       167
CWE762_Mismatched_Memory_Management_Routines       0.79      0.85      0.82       349
               CWE789_Uncontrolled_Mem_Alloc       0.57      0.97      0.72       323
                  CWE78_OS_Command_Injection       1.00      0.95      0.97       350

                                    accuracy                           0.84      7628
                                   macro avg       0.87      0.81      0.83      7628
                                weighted avg       0.86      0.84      0.84      7628

#                                               precision    recall  f1-score   support

#           CWE121_Stack_Based_Buffer_Overflow       0.91      0.71      0.80       324
#            CWE122_Heap_Based_Buffer_Overflow       0.87      0.66      0.75       316
#                     CWE124_Buffer_Underwrite       0.63      0.85      0.72       331
#                       CWE126_Buffer_Overread       0.86      0.92      0.89       335
#                      CWE127_Buffer_Underread       0.92      0.77      0.83       333
#            CWE134_Uncontrolled_Format_String       0.98      0.90      0.94       350
#                      CWE190_Integer_Overflow       0.94      0.79      0.86       298
#                     CWE191_Integer_Underflow       0.95      0.77      0.85       294
#             CWE194_Unexpected_Sign_Extension       1.00      0.59      0.74       165
#   CWE195_Signed_to_Unsigned_Conversion_Error       0.99      0.55      0.71       158
#              CWE197_Numeric_Truncation_Error       0.71      0.90      0.79       350
#                CWE23_Relative_Path_Traversal       0.91      0.98      0.95       350
#                        CWE369_Divide_by_Zero       0.83      0.91      0.87       350
#                CWE36_Absolute_Path_Traversal       0.82      0.91      0.86       350
#                   CWE400_Resource_Exhaustion       1.00      0.79      0.89       156
#                           CWE401_Memory_Leak       0.81      0.82      0.81       333
#                           CWE415_Double_Free       0.76      0.76      0.76       350
#         CWE457_Use_of_Uninitialized_Variable       0.94      0.97      0.96       297
#                       CWE563_Unused_Variable       0.81      1.00      0.89       350
#               CWE590_Free_Memory_Not_on_Heap       0.88      0.88      0.88       348
#   CWE680_Integer_Overflow_to_Buffer_Overflow       0.98      0.85      0.91       301
#                CWE690_NULL_Deref_From_Return       1.00      0.31      0.47       167
# CWE762_Mismatched_Memory_Management_Routines       0.79      0.85      0.82       349
#                CWE789_Uncontrolled_Mem_Alloc       0.57      0.97      0.72       323
#                   CWE78_OS_Command_Injection       1.00      0.95      0.97       350

#                                     accuracy                           0.84      7628
#                                    macro avg       0.87      0.81      0.83      7628
#                                 weighted avg       0.86      0.84      0.84      7628

Cross Validation

# %%time
clf = MultinomialNB()
# scores = cross_val_score(clf, text_tfidf, df['Category'],  cv=8)
scores = cross_val_score(clf, text_tfidf, df_target,  cv=8)

#scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
## Accuracy: 0.74 (+/- 0.03)

# CPU times: user 156 ms, sys: 8.02 ms, total: 164 ms
# Wall time: 167 ms
Accuracy: 0.74 (+/- 0.03)

RandomForest model

# from sklearn.ensemble import RandomForestClassifier

Cross Validation

This may take few minutes

%%time
clf = RandomForestClassifier()
scores = cross_val_score(clf, text_tfidf, df_target,  cv=8)
#scores
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

## Accuracy: 0.87 (+/- 0.02) ## Accuracy: 0.87 (+/- 0.01)

## CPU times: user 2min 7s, sys: 287 ms, total: 2min 7s
## Wall time: 2min 10s
Accuracy: 0.87 (+/- 0.02)
CPU times: user 2min 2s, sys: 437 ms, total: 2min 2s
Wall time: 2min 4s

Split, Clf MultinomialNB, Plt Confusion Matrix

Train_Test_Split w/o CV

start = time.time()
#classifier = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stopwords.words('english') + list(string.punctuation))),('classifier', LinearSVC(C=10))])
## train_test_split
# from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(text_tfidf, df['Category'], test_size=0.2, random_state=0) ## 11
X_train, X_test, y_train, y_test = train_test_split(text_tfidf, df_target, test_size=0.2, random_state=0) ## 11
# %%time
clf = MultinomialNB()

clf.fit(X_train, y_train)

end = time.time()
# print("Accuracy: " + str(clf.score(X_test, y_test))) #+ ", Time duration: " + str(end - start))
print("Accuracy: " + str(clf.score(X_test, y_test)) + ", Time duration: " + str(end - start))
## Accuracy: 0.7175622542595019, Time duration: 0.029235124588012695

## CPU times: user 25.7 ms, sys: 0 ns, total: 25.7 ms
## Wall time: 31.1 ms
Accuracy: 0.7326343381389253, Time duration: 0.038088321685791016
## Confusion Matrix
y_pred = clf.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred)
# Plot confusion_matrix
fig, ax = plt.subplots(figsize=(15, 10))
sns.heatmap(conf_mat, annot=True, cmap = "Set3", fmt ="d",
xticklabels=data['target_names'], yticklabels=data['target_names'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

Model Dump

# %%time
## train a NB classifer on the entire data

# nb_model = MultinomialNB().fit(text_tfidf, df['Category'])
nb_model = MultinomialNB().fit(text_tfidf, df_target)

with open('nb_model.pk', 'wb') as nb:
    pickle.dump(nb_model, nb)

## CPU times: user 26.5 ms, sys: 4.99 ms, total: 31.5 ms
## Wall time: 33.1 ms

1st In-Sample Test-Case Foresights

data_filtered.columns ## Index(['Category', 'Test-Case-Code'], dtype='object')
Index(['Text', 'Category'], dtype='object')
# data_filtered['Category']
df_target
0       11
1        3
2       18
3       24
4        4
        ..
7623    16
7624    11
7625     5
7626     8
7627     8
Name: Category, Length: 7628, dtype: int64
data_filtered
Text Category
0 [includ, stdtestcaseh, ifdef, win32, defin, ba... 11
1 [includ, stdtestcaseh, includ, wcharh, namespa... 3
2 [ifndef, omitgood, includ, stdtestcaseh, inclu... 18
3 [includ, stdtestcaseh, includ, environmentw32s... 24
4 [includ, stdtestcaseh, includ, wcharh, namespa... 4
... ... ...
7623 [includ, stdtestcaseh, includ, wcharh, namespa... 16
7624 [includ, stdtestcaseh, ifdef, win32, defin, ba... 11
7625 [includ, stdtestcaseh, ifndef, win32, includ, ... 5
7626 [includ, stdtestcaseh, includ, vector, defin, ... 8
7627 [includ, stdtestcaseh, includ, map, use, names... 8

7628 rows × 2 columns

# %%time
# Specify the indices
# indices = [0,1,2]
indices = [3,4,5,6]
# indices = [7,8,9]

# Load the transformers and the model
bow_transf = pickle.load(open("bow_transformer.pk", "rb"))
tfidf_transf = pickle.load(open("tfidf_transformer.pk", "rb"))
with open('nb_model.pk', 'rb') as nb:
    model = pickle.load(nb)

# Create a mapping from class labels to CWE-IDs
label_to_cwe = {i: cwe_id for i, cwe_id in enumerate(df_target.unique())}

# Loop through the specified indices
for in_sample_top_10_cwe_index in indices:
    test_text = data_filtered.iloc[in_sample_top_10_cwe_index,1]
    test_bow = bow_transf.transform([test_text])
    test_tfidf = tfidf_transf.transform(test_bow)
    foresights_id = model.predict(test_tfidf)[0] # Access the first element of the array
    print('-----------------------------------------')
    print(f"The predicted CWE-ID is: {foresights_id}")
    # Use the mapping to get the CWE-ID
    if foresights_id in label_to_cwe:
        predicted_cwe_id = label_to_cwe[foresights_id]
        print(f"The predicted CWE-ID is: {predicted_cwe_id}")
        # Check if the model's prediction matches the actual CWE-ID
        actual_cwe_id = data_filtered.iloc[in_sample_top_10_cwe_index]['Category']
        is_correct = (actual_cwe_id == predicted_cwe_id)
        print(f"Is the prediction correct? {is_correct}")
    else:
        print(f"The predicted label {foresights_id} is not in the training data.")

## CPU times: user 52.5 ms, sys: 4.17 ms, total: 56.7 ms
## Wall time: 103 ms
-----------------------------------------
The predicted CWE-ID is: 5
The predicted CWE-ID is: 17
Is the prediction correct? False
-----------------------------------------
The predicted CWE-ID is: 5
The predicted CWE-ID is: 17
Is the prediction correct? False
-----------------------------------------
The predicted CWE-ID is: 1
The predicted CWE-ID is: 3
Is the prediction correct? False
-----------------------------------------
The predicted CWE-ID is: 5
The predicted CWE-ID is: 17
Is the prediction correct? False
# -----------------------------------------
# The predicted CWE-ID is: 19
# The predicted label 19 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID is: 4
# The predicted CWE-ID is: CWE127_Buffer_Underread
# Is the prediction correct? True
# -----------------------------------------
# The predicted CWE-ID is: 5
# The predicted CWE-ID is: CWE134_Uncontrolled_Format_String
# Is the prediction correct? True
# -----------------------------------------
# The predicted CWE-ID is: 0
# The predicted CWE-ID is: CWE122_Heap_Based_Buffer_Overflow
# Is the prediction correct? False
data_filtered.iloc[6,1]
4
data_filtered.iloc[6,0]
# print(data_filtered.iloc[6,0])
['ifndef',
 'omitgood',
 'includ',
 'stdtestcaseh',
 'includ',
 'charloop83h',
 'namespac',
 'charloop83',
 'charloop83goodg2bcharloop83goodg2bchar',
 'datacopi',
 'data',
 'datacopi',
 'char',
 'databuff',
 'char',
 'malloc100sizeofchar',
 'memsetdatabuff',
 '1001',
 'databuffer1001',
 '0',
 'data',
 'databuff',
 'charloop83goodg2bcharloop83goodg2b',
 'sizet',
 'char',
 'dest100',
 'memsetdest',
 'c',
 '1001',
 'dest1001',
 '0',
 '0',
 '100',
 'desti',
 'datai',
 'dest1001',
 '0',
 'printlinedest',
 'endif']
## print out the actual predicted CWE-ID instaed of array([3])
test_text = data_filtered.iloc[6,1]

# this is how you reload and use the BoW transformer
bow_transf = pickle.load(open("bow_transformer.pk", "rb"))
test_bow = bow_transf.transform([test_text])

# this is how you reload and use the TF-IDF transformer
# remember it is applied to the result of bow_transformer
tfidf_transf = pickle.load(open("tfidf_transformer.pk", "rb"))
test_tfidf = tfidf_transf.transform(test_bow)

# here we reload the saved NaiveBayes model and use it to predict the class of our test text
with open('nb_model.pk', 'rb') as nb:
    model = pickle.load(nb)

# model.predict(test_tfidf) ## array([3])  ## array([13])
foresights_id = model.predict(test_tfidf)[0] # Access the first element of the array
print(f"The predicted CWE-ID is: {foresights_id}")

# Create a mapping from class labels to CWE-IDs
label_to_cwe = {i: cwe_id for i, cwe_id in enumerate(df_target.unique())}

# Use the mapping to get the CWE-ID
if foresights_id in label_to_cwe:
    predicted_cwe_id = label_to_cwe[foresights_id]
    print(f"The predicted CWE-ID is: {predicted_cwe_id}")
else:
    print(f"The predicted label {foresights_id} is not in the training data.")
The predicted CWE-ID is: 5
The predicted CWE-ID is: 17

In-Sample & Out-of-Sample Foresights

!wget https://raw.githubusercontent.com/c6ai/temp/main/sard.zip
--2024-01-07 10:35:35--  https://raw.githubusercontent.com/c6ai/temp/main/sard.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2947907 (2.8M) [application/zip]
Saving to: ‘sard.zip’

sard.zip            100%[===================>]   2.81M  --.-KB/s    in 0.02s   

2024-01-07 10:35:36 (170 MB/s) - ‘sard.zip’ saved [2947907/2947907]

out_of_sample_df = pd.read_csv('sard.zip', compression='zip')
print(out_of_sample_df.head())
                                                code CWE-Type DataType
0  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE114     SARD
1  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE114     SARD
2  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE114     SARD
3  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE114     SARD
4  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE114     SARD
out_of_sample_df ## 52802 rows × 3 columns
code CWE-Type DataType
0 \n \n \n #include "IncludeMarker"\n \n #includ... CWE114 SARD
1 \n \n \n #include "IncludeMarker"\n \n #includ... CWE114 SARD
2 \n \n \n #include "IncludeMarker"\n \n #includ... CWE114 SARD
3 \n \n \n #include "IncludeMarker"\n \n #includ... CWE114 SARD
4 \n \n \n #include "IncludeMarker"\n \n #includ... CWE114 SARD
... ... ... ...
52797 \n \n \n #include "IncludeMarker"\n \n #define... CWE688 SARD
52798 \n \n \n #include "IncludeMarker"\n \n #define... CWE688 SARD
52799 \n \n \n #include "IncludeMarker"\n \n #define... CWE688 SARD
52800 \n \n \n #include "IncludeMarker"\n \n #define... CWE688 SARD
52801 \n \n \n #include "IncludeMarker"\n \n #define... CWE688 SARD

52802 rows × 3 columns

out_of_sample_df.columns ## Index(['code', 'CWE-Type', 'DataType'], dtype='object')
Index(['code', 'CWE-Type', 'DataType'], dtype='object')
cwe_list = list(out_of_sample_df['CWE-Type'].unique())
print(len(cwe_list)) ## 109
print(cwe_list)
109
['CWE114', 'CWE121', 'CWE122', 'CWE123', 'CWE124', 'CWE126', 'CWE127', 'CWE134', 'CWE15', 'CWE176', 'CWE188', 'CWE190', 'CWE191', 'CWE194', 'CWE195', 'CWE196', 'CWE197', 'CWE222', 'CWE223', 'CWE226', 'CWE242', 'CWE244', 'CWE247', 'CWE252', 'CWE253', 'CWE256', 'CWE259', 'CWE272', 'CWE273', 'CWE284', 'CWE319', 'CWE321', 'CWE325', 'CWE327', 'CWE328', 'CWE338', 'CWE364', 'CWE366', 'CWE367', 'CWE369', 'CWE377', 'CWE390', 'CWE391', 'CWE398', 'CWE400', 'CWE401', 'CWE404', 'CWE415', 'CWE416', 'CWE426', 'CWE427', 'CWE457', 'CWE459', 'CWE464', 'CWE467', 'CWE468', 'CWE469', 'CWE475', 'CWE476', 'CWE478', 'CWE479', 'CWE480', 'CWE481', 'CWE482', 'CWE483', 'CWE484', 'CWE506', 'CWE510', 'CWE511', 'CWE526', 'CWE534', 'CWE535', 'CWE546', 'CWE561', 'CWE562', 'CWE563', 'CWE570', 'CWE571', 'CWE587', 'CWE588', 'CWE590', 'CWE591', 'CWE605', 'CWE606', 'CWE615', 'CWE617', 'CWE620', 'CWE665', 'CWE666', 'CWE667', 'CWE674', 'CWE675', 'CWE680', 'CWE681', 'CWE685', 'CWE690', 'CWE758', 'CWE761', 'CWE773', 'CWE775', 'CWE780', 'CWE785', 'CWE789', 'CWE78', 'CWE832', 'CWE835', 'CWE843', 'CWE90', 'CWE688']
len(data.target_names)
25
data.target_names
['CWE121_Stack_Based_Buffer_Overflow',
 'CWE122_Heap_Based_Buffer_Overflow',
 'CWE124_Buffer_Underwrite',
 'CWE126_Buffer_Overread',
 'CWE127_Buffer_Underread',
 'CWE134_Uncontrolled_Format_String',
 'CWE190_Integer_Overflow',
 'CWE191_Integer_Underflow',
 'CWE194_Unexpected_Sign_Extension',
 'CWE195_Signed_to_Unsigned_Conversion_Error',
 'CWE197_Numeric_Truncation_Error',
 'CWE23_Relative_Path_Traversal',
 'CWE369_Divide_by_Zero',
 'CWE36_Absolute_Path_Traversal',
 'CWE400_Resource_Exhaustion',
 'CWE401_Memory_Leak',
 'CWE415_Double_Free',
 'CWE457_Use_of_Uninitialized_Variable',
 'CWE563_Unused_Variable',
 'CWE590_Free_Memory_Not_on_Heap',
 'CWE680_Integer_Overflow_to_Buffer_Overflow',
 'CWE690_NULL_Deref_From_Return',
 'CWE762_Mismatched_Memory_Management_Routines',
 'CWE789_Uncontrolled_Mem_Alloc',
 'CWE78_OS_Command_Injection']
# Assuming data.target_names is a list of target names
target_names = data.target_names

# Create an empty list to store the top 22/25 CWE-Type
top_22_cwe_list = []

# Iterate over the unique CWE-Type list
for cwe_type in cwe_list:
    # Check if the CWE-Type matches the beginning part of any target name
    for target in target_names:
        if target.startswith(cwe_type):
            top_22_cwe_list.append(cwe_type)
            break
    # Stop adding to the list after the top 22 ##!25
    if len(top_22_cwe_list) == 22:
        break
print(len(top_22_cwe_list)) ## 22
print(top_22_cwe_list)
## ['CWE121', 'CWE122', 'CWE124', 'CWE126', 'CWE127', 'CWE134', 'CWE190', 'CWE191','CWE194', 'CWE195', 'CWE197',
## 'CWE369', 'CWE400', 'CWE401', 'CWE415', 'CWE457', 'CWE563', 'CWE590', 'CWE680', 'CWE690', 'CWE789', 'CWE78']
22
['CWE121', 'CWE122', 'CWE124', 'CWE126', 'CWE127', 'CWE134', 'CWE190', 'CWE191', 'CWE194', 'CWE195', 'CWE197', 'CWE369', 'CWE400', 'CWE401', 'CWE415', 'CWE457', 'CWE563', 'CWE590', 'CWE680', 'CWE690', 'CWE789', 'CWE78']
## make a top_22_cwe_testcase_samples_df with one sample each for the 1st 'code' with 'CWE-Type' that matches op_22_cwe_list

# Assuming out_of_sample_df is the DataFrame and top_22_cwe_list is the list
top_22_cwe_testcase_samples_df = pd.DataFrame()

for cwe in top_22_cwe_list:
    sample_df = out_of_sample_df[out_of_sample_df['CWE-Type'] == cwe].head(1)
    top_22_cwe_testcase_samples_df = pd.concat([top_22_cwe_testcase_samples_df, sample_df])
print(len(top_22_cwe_testcase_samples_df)) ## 22
print(top_22_cwe_testcase_samples_df)
22
                                                    code CWE-Type DataType
648    \n \n \n #include "IncludeMarker"\n \n #ifndef...   CWE121     SARD
6351   \n \n \n #include "IncludeMarker"\n \n #ifndef...   CWE122     SARD
10044  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE124     SARD
11868  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE126     SARD
13200  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE127     SARD
15024  \n \n \n #include "IncludeMarker"\n \n #ifndef...   CWE134     SARD
18408  \n \n \n #include "IncludeMarker"\n \n #ifndef...   CWE190     SARD
23268  \n \n \n #include "IncludeMarker"\n \n #ifndef...   CWE191     SARD
26994  \n \n \n #include "IncludeMarker"\n \n #ifdef ...   CWE194     SARD
28301  \n \n \n #include "IncludeMarker"\n \n #ifdef ...   CWE195     SARD
29627  \n \n \n #include "IncludeMarker"\n \n #ifdef ...   CWE197     SARD
33468  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE369     SARD
34891  \n \n \n #include "IncludeMarker"\n \n #ifdef ...   CWE400     SARD
35701  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE401     SARD
37325  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE415     SARD
38555  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE457     SARD
40395  \n \n #include "IncludeMarker"\n \n #ifndef OM...   CWE563     SARD
40860  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE590     SARD
43314  \n \n \n #include "IncludeMarker"\n \n #ifdef ...   CWE680     SARD
43713  \n \n \n #include "IncludeMarker"\n \n #includ...   CWE690     SARD
46126  \n \n \n #include "IncludeMarker"\n \n #ifndef...   CWE789     SARD
46666  \n \n \n #include "IncludeMarker"\n \n #includ...    CWE78     SARD
top_22_cwe_testcase_samples_df.iloc[0, 1] ## CWE121
{"type":"string"}
# print(out_of_sample_df.iloc[0, 0])
cwe_testcase_sample = top_22_cwe_testcase_samples_df.iloc[0, 0]
# print(cwe_testcase_sample)

'cpp_clean_files_top_10_cwe_omitted'

!wget https://raw.githubusercontent.com/c6ai/temp/main/cpp_clean_files_top_10_cwe_omitted.tar.gz

# ## cpp_clean_files_top_10_cwe_omitted.tar.gz
# !gdown 1YQHdd457W4NjuTvJYiucKUr8pRwbGulj
--2024-01-07 10:35:38--  https://raw.githubusercontent.com/c6ai/temp/main/cpp_clean_files_top_10_cwe_omitted.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14922 (15K) [application/octet-stream]
Saving to: ‘cpp_clean_files_top_10_cwe_omitted.tar.gz’


          cpp_clean   0%[                    ]       0  --.-KB/s               
cpp_clean_files_top 100%[===================>]  14.57K  --.-KB/s    in 0s      

2024-01-07 10:35:38 (93.2 MB/s) - ‘cpp_clean_files_top_10_cwe_omitted.tar.gz’ saved [14922/14922]

# !cp -r cpp_clean_omitted_cwe_folders cpp_clean_files_top_10_cwe_omitted
# !tar -czf cpp_clean_files_top_10_cwe_omitted.tar.gz cpp_clean_files_top_10_cwe_omitted

!tar -xzf cpp_clean_files_top_10_cwe_omitted.tar.gz cpp_clean_files_top_10_cwe_omitted

Top 10 - after prepping .cpp

# Initialize an empty list to store the data
data = []

# Specify the root folder
root_folder = 'cpp_clean_files_top_10_cwe_omitted'

# Traverse the directory
for subdir, dirs, files in os.walk(root_folder):
    for file in files:
        # Check if the file is a .cpp file
        if file.endswith('.cpp'):
            # Get the CWE-ID from the subfolder name
            cwe_id = os.path.basename(subdir)

            # Open the file and read its content
            with open(os.path.join(subdir, file), 'r') as f:
                file_content = f.read()

            # Append the data to the list
            data.append([cwe_id, file_content])

            # We only need the first .cpp file from each subfolder
            break

# Create a DataFrame
in_sample_top_10_cwe_df = pd.DataFrame(data, columns=['CWE-ID', 'Test-Case-Code'])

# Print the DataFrame
print(in_sample_top_10_cwe_df)
                                         CWE-ID  \
0             CWE122_Heap_Based_Buffer_Overflow   
1                            CWE401_Memory_Leak   
2                 CWE36_Absolute_Path_Traversal   
3                       CWE127_Buffer_Underread   
4                 CWE23_Relative_Path_Traversal   
5            CWE121_Stack_Based_Buffer_Overflow   
6                            CWE415_Double_Free   
7  CWE762_Mismatched_Memory_Management_Routines   
8                CWE590_Free_Memory_Not_on_Heap   
9             CWE134_Uncontrolled_Format_String   

                                      Test-Case-Code  
0  \n\n\n#include "std_testcase.h"\n\n#include <w...  
1  \n\n#ifndef OMITBAD\n\n#include "std_testcase....  
2  \n\n\n#include "std_testcase.h"\n\n#ifndef _WI...  
3  \n\n\n#include "std_testcase.h"\n\n#include <w...  
4  \n\n\n#include "std_testcase.h"\n\n#ifdef _WIN...  
5  \n\n\n#include "std_testcase.h"\n#include <lis...  
6  \n\n#ifndef OMITBAD\n\n#include "std_testcase....  
7  \n\n\n#include "std_testcase.h"\n#include <lis...  
8  \n\n\n#include "std_testcase.h"\n\n#include <w...  
9  \n\n#ifndef OMITGOOD\n\n#include "std_testcase...  
in_sample_top_10_cwe_df.columns # .iloc[0,0] ## Index(['CWE-ID', 'Test-Case-Code'], dtype='object')
Index(['CWE-ID', 'Test-Case-Code'], dtype='object')
# in_sample_top_10_cwe_df.iloc[0,1]
print(in_sample_top_10_cwe_df.iloc[0,1])



#include "std_testcase.h"

#include <wchar.h>


static int staticReturnsTrue()
{
    return 1;
}

static int staticReturnsFalse()
{
    return 0;
}

namespace CWE805_char_memcpy_08
{

#ifndef OMITBAD

void bad()
{
    char * data;
    data = NULL;
    if(staticReturnsTrue())
    {
        
        data = new char[50];
        data[0] = '\0'; 
    }
    {
        char source[100];
        memset(source, 'C', 100-1); 
        source[100-1] = '\0'; 
        
        memcpy(data, source, 100*sizeof(char));
        data[100-1] = '\0'; 
        printLine(data);
        delete [] data;
    }
}

#endif 

#ifndef OMITGOOD


static void goodG2B1()
{
    char * data;
    data = NULL;
    if(staticReturnsFalse())
    {
        
        printLine("Benign, fixed string");
    }
    else
    {
        
        data = new char[100];
        data[0] = '\0'; 
    }
    {
        char source[100];
        memset(source, 'C', 100-1); 
        source[100-1] = '\0'; 
        
        memcpy(data, source, 100*sizeof(char));
        data[100-1] = '\0'; 
        printLine(data);
        delete [] data;
    }
}


static void goodG2B2()
{
    char * data;
    data = NULL;
    if(staticReturnsTrue())
    {
        
        data = new char[100];
        data[0] = '\0'; 
    }
    {
        char source[100];
        memset(source, 'C', 100-1); 
        source[100-1] = '\0'; 
        
        memcpy(data, source, 100*sizeof(char));
        data[100-1] = '\0'; 
        printLine(data);
        delete [] data;
    }
}

void good()
{
    goodG2B1();
    goodG2B2();
}

#endif 

} 



#ifdef INCLUDEMAIN

using namespace CWE805_char_memcpy_08; 

int main(int argc, char * argv[])
{
    
    srand( (unsigned)time(NULL) );
#ifndef OMITGOOD
    printLine("Calling good()...");
    good();
    printLine("Finished good()");
#endif 
#ifndef OMITBAD
    printLine("Calling bad()...");
    bad();
    printLine("Finished bad()");
#endif 
    return 0;
}

#endif

in_sample_top_10_cwe_df.iloc[0,0] ## CWE122_Heap_Based_Buffer_Overflow
{"type":"string"}
print(in_sample_top_10_cwe_df.shape) # (10, 2)
(10, 2)

..........

in_sample_top_10_cwe_df.columns ## Index(['CWE-ID', 'Test-Case-Code'], dtype='object')
Index(['CWE-ID', 'Test-Case-Code'], dtype='object')
in_sample_top_10_cwe_df['CWE-ID']
0               CWE122_Heap_Based_Buffer_Overflow
1                              CWE401_Memory_Leak
2                   CWE36_Absolute_Path_Traversal
3                         CWE127_Buffer_Underread
4                   CWE23_Relative_Path_Traversal
5              CWE121_Stack_Based_Buffer_Overflow
6                              CWE415_Double_Free
7    CWE762_Mismatched_Memory_Management_Routines
8                  CWE590_Free_Memory_Not_on_Heap
9               CWE134_Uncontrolled_Format_String
Name: CWE-ID, dtype: object
in_sample_top_10_cwe_df
CWE-ID Test-Case-Code
0 CWE122_Heap_Based_Buffer_Overflow \n\n\n#include "std_testcase.h"\n\n#include <w...
1 CWE401_Memory_Leak \n\n#ifndef OMITBAD\n\n#include "std_testcase....
2 CWE36_Absolute_Path_Traversal \n\n\n#include "std_testcase.h"\n\n#ifndef _WI...
3 CWE127_Buffer_Underread \n\n\n#include "std_testcase.h"\n\n#include <w...
4 CWE23_Relative_Path_Traversal \n\n\n#include "std_testcase.h"\n\n#ifdef _WIN...
5 CWE121_Stack_Based_Buffer_Overflow \n\n\n#include "std_testcase.h"\n#include <lis...
6 CWE415_Double_Free \n\n#ifndef OMITBAD\n\n#include "std_testcase....
7 CWE762_Mismatched_Memory_Management_Routines \n\n\n#include "std_testcase.h"\n#include <lis...
8 CWE590_Free_Memory_Not_on_Heap \n\n\n#include "std_testcase.h"\n\n#include <w...
9 CWE134_Uncontrolled_Format_String \n\n#ifndef OMITGOOD\n\n#include "std_testcase...
# %%time

# Specify the indices
# indices = [0,1,2]
indices = [3,4,5,6]
# indices = [7,8,9]

# Load the transformers and the model
bow_transf = pickle.load(open("bow_transformer.pk", "rb"))
tfidf_transf = pickle.load(open("tfidf_transformer.pk", "rb"))
with open('nb_model.pk', 'rb') as nb:
    model = pickle.load(nb)

# Create a mapping from class labels to CWE-IDs
label_to_cwe = {i: cwe_id for i, cwe_id in enumerate(in_sample_top_10_cwe_df['CWE-ID'].unique())}

# Loop through the specified indices
for in_sample_top_10_cwe_index in indices:
    test_text = in_sample_top_10_cwe_df.iloc[in_sample_top_10_cwe_index,1]
    test_bow = bow_transf.transform([test_text])
    test_tfidf = tfidf_transf.transform(test_bow)
    foresights_id = model.predict(test_tfidf)[0] # Access the first element of the array
    print('-----------------------------------------')
    print(f"The predicted CWE-ID is: {foresights_id}")
    # Use the mapping to get the CWE-ID
    if foresights_id in label_to_cwe:
        predicted_cwe_id = label_to_cwe[foresights_id]
        print(f"The predicted CWE-ID is: {predicted_cwe_id}")
        # Check if the model's prediction matches the actual CWE-ID
        actual_cwe_id = in_sample_top_10_cwe_df.iloc[in_sample_top_10_cwe_index]['CWE-ID']
        is_correct = (actual_cwe_id == predicted_cwe_id)
        print(f"Is the prediction correct? {is_correct}")
    else:
        print(f"The predicted label {foresights_id} is not in the training data.")

## CPU times: user 52.5 ms, sys: 4.17 ms, total: 56.7 ms
## Wall time: 103 ms
-----------------------------------------
The predicted CWE-ID is: 4
The predicted CWE-ID is: CWE23_Relative_Path_Traversal
Is the prediction correct? False
-----------------------------------------
The predicted CWE-ID is: 11
The predicted label 11 is not in the training data.
-----------------------------------------
The predicted CWE-ID is: 0
The predicted CWE-ID is: CWE122_Heap_Based_Buffer_Overflow
Is the prediction correct? False
-----------------------------------------
The predicted CWE-ID is: 16
The predicted label 16 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID is: 19
# The predicted label 19 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID is: 4
# The predicted CWE-ID is: CWE127_Buffer_Underread
# Is the prediction correct? True
# -----------------------------------------
# The predicted CWE-ID is: 5
# The predicted CWE-ID is: CWE134_Uncontrolled_Format_String
# Is the prediction correct? True
# -----------------------------------------
# The predicted CWE-ID is: 0
# The predicted CWE-ID is: CWE122_Heap_Based_Buffer_Overflow
# Is the prediction correct? False
# # Load the transformers and the model
# bow_transf = pickle.load(open("bow_transformer.pk", "rb"))
# tfidf_transf = pickle.load(open("tfidf_transformer.pk", "rb"))
# with open('nb_model.pk', 'rb') as nb:
#     model = pickle.load(nb)

# # Create a mapping from class labels to CWE-IDs
# label_to_cwe = {i: cwe_id for i, cwe_id in enumerate(in_sample_top_10_cwe_df['CWE-ID'].unique())}

# # Loop through all 10 samples
# for in_sample_top_10_cwe_index in range(10):
#     test_text = in_sample_top_10_cwe_df.iloc[in_sample_top_10_cwe_index,1]
#     test_bow = bow_transf.transform([test_text])
#     test_tfidf = tfidf_transf.transform(test_bow)
#     foresights_id = model.predict(test_tfidf)[0] # Access the first element of the array
#     print('-----------------------------------------')
#     print(f"The predicted CWE-ID index is: {foresights_id}")
#     # Use the mapping to get the CWE-ID
#     if foresights_id in label_to_cwe:
#         predicted_cwe_id = label_to_cwe[foresights_id]
#         print(f"The predicted CWE-ID label is: {predicted_cwe_id}")
#         # Check if the model's prediction matches the actual CWE-ID
#         actual_cwe_id = in_sample_top_10_cwe_df.iloc[in_sample_top_10_cwe_index]['CWE-ID']
#         is_correct = (actual_cwe_id == predicted_cwe_id)
#         print(f"Is the prediction correct? {is_correct}")
#     else:
#         print(f"The predicted label {foresights_id} is not in the training data.")
# -----------------------------------------
# The predicted CWE-ID index is: 3
# The predicted CWE-ID label is: CWE415_Double_Free
# Is the prediction correct? False
# -----------------------------------------
# The predicted CWE-ID index is: 11
# The predicted label 11 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID index is: 19
# The predicted label 19 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID index is: 19
# The predicted label 19 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID index is: 4
# The predicted CWE-ID label is: CWE127_Buffer_Underread
# Is the prediction correct? True
# -----------------------------------------
# The predicted CWE-ID index is: 5
# The predicted CWE-ID label is: CWE134_Uncontrolled_Format_String
# Is the prediction correct? True
# -----------------------------------------
# The predicted CWE-ID index is: 0
# The predicted CWE-ID label is: CWE122_Heap_Based_Buffer_Overflow
# Is the prediction correct? False
# -----------------------------------------
# The predicted CWE-ID index is: 11
# The predicted label 11 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID index is: 22
# The predicted label 22 is not in the training data.
# -----------------------------------------
# The predicted CWE-ID index is: 15
# The predicted label 15 is not in the training data.
in_sample_top_10_cwe_df.iloc[6,1]
{"type":"string"}
in_sample_top_10_cwe_df.iloc[6,0]
{"type":"string"}
## print out the actual predicted CWE-ID instaed of array([3])

test_text = in_sample_top_10_cwe_df.iloc[6,1]

# this is how you reload and use the BoW transformer
bow_transf = pickle.load(open("bow_transformer.pk", "rb"))
test_bow = bow_transf.transform([test_text])

# this is how you reload and use the TF-IDF transformer
# remember it is applied to the result of bow_transformer
tfidf_transf = pickle.load(open("tfidf_transformer.pk", "rb"))
test_tfidf = tfidf_transf.transform(test_bow)

# here we reload the saved NaiveBayes model and use it to predict the class of our test text
with open('nb_model.pk', 'rb') as nb:
    model = pickle.load(nb)

# model.predict(test_tfidf) ## array([3])  ## array([13])
foresights_id = model.predict(test_tfidf)[0] # Access the first element of the array
print(f"The predicted CWE-ID is: {foresights_id}")

# Create a mapping from class labels to CWE-IDs
label_to_cwe = {i: cwe_id for i, cwe_id in enumerate(in_sample_top_10_cwe_df['CWE-ID'].unique())}

# Use the mapping to get the CWE-ID
if foresights_id in label_to_cwe:
    predicted_cwe_id = label_to_cwe[foresights_id]
    print(f"The predicted CWE-ID is: {predicted_cwe_id}")
else:
    print(f"The predicted label {foresights_id} is not in the training data.")
The predicted CWE-ID is: 16
The predicted label 16 is not in the training data.

.......

Baseline/Common-sense MODELs ...

# text_tfidf = tfidf_transformer.transform(text_bow)
# print(text_tfidf.shape) ## (7628, 12766)

# # Train non-leaking samples

# # text_tfidf.shape ## (7628, 12766)

from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(text_tfidf, df['Category'], test_size=0.2, random_state=0) ## 11
X_train, X_test, y_train, y_test = train_test_split(text_tfidf, df_target, test_size=0.2, random_state=0) ## 11
%%time
# text_tfidf[2:4,1000:2050].toarray()
# text_tfidf['Category'].toarray()

# test_bow = bow_transformer.transform([input_text])
test_bow = bow_transformer.transform(df['Text'])
test_data = tfidf_transformer.transform(test_bow)
# model.predict(test_data)

test_data.shape ## (7628, 12766)
## CPU times: user 16.8 s, sys: 36.6 ms, total: 16.8 s
## Wall time: 16.8 s
CPU times: user 15.1 s, sys: 257 ms, total: 15.4 s
Wall time: 15.5 s
(7628, 12766)

.............

# X_text=data_filtered.iloc[:,0] # subset of only [Description] raw text
X_text=text_tfidf ## OK!
X_text
<7628x12766 sparse matrix of type '<class 'numpy.float64'>'
	with 316364 stored elements in Compressed Sparse Row format>
# y_category=data_filtered.iloc[:,1:4] # subset of only [Category, Level_2, Level_3] Category cols
y_category=df_target  # df_cat # data['Category'] ## dataset.target
y_category
0       11
1        3
2       18
3       24
4        4
        ..
7623    16
7624    11
7625     5
7626     8
7627     8
Name: Category, Length: 7628, dtype: int64
X = text_tfidf # data['Text'] ## dataset.data #! ## ValueError: could not convert string to float: 'peopl allar ...'
# y = data['Category']  # df_cat # data['Category'] ## dataset.target
y = df_target  # df_cat # data['Category'] ## dataset.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state =0)
## Train/Test split
# train,test=train_test_split(data_drop_na,test_size=0.25,random_state=0)
# X_tr,X_te,y_tr,y_te=train_test_split(X_text,y_category,test_size=0.2,random_state=0) # test_size=0.25
X_tr,X_te,y_tr,y_te=train_test_split(X_text,y_category,test_size=0.2,random_state=0) # test_size=0.25
## train.shape ## (7977, 4) ## (7986, 16506)
# y_tr.shape ## (7970, 3) ## (7977, 3)
y_te.shape ## (1526, 1)  ## (2657, 3) ## (2660, 3)
(1526,)

Deep Learning

"""
## Setup
"""
## !pip install -q "tensorflow-text" # ==2.13.*"
# import tensorflow_text as tf_text

# ## TensorFlow backend only supports string inputs
# os.environ["KERAS_BACKEND"] = "tensorflow"
# import keras
# from keras import layers
# import tensorflow as tf

## IMPORTS & Params at the top
{"type":"string"}
"""
## Download the SARD cpp_8750_files data
"""

# !gdown 1Q_P8bYpvdSEbp6NnCzfqU3lwQwxUlfE3
!wget https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz

# data_path = keras.utils.get_file(
#     "sard.zip",
#     "https://raw.githubusercontent.com/c6ai/temp/main/sard.zip",
#     untar=True,
# )

# data_path = keras.utils.get_file(
#     "cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz",
#     "https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz",
#     untar=True,
# )
--2024-01-07 10:35:54--  https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 895869 (875K) [application/octet-stream]
Saving to: ‘cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz’


          cpp_clean   0%[                    ]       0  --.-KB/s               
cpp_cleaner_8750_fi 100%[===================>] 874.87K  --.-KB/s    in 0.006s  

2024-01-07 10:35:54 (140 MB/s) - ‘cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz’ saved [895869/895869]

## extract or un-tar (unzip):
!tar -xzf cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
!ls -lh # /content/
## total 880K
## drwxr-xr-x 27 root root 4.0K Jan  2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
## -rw-r--r--  1 root root 875K Jan  2 09:29 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
total 20M
-rw-r--r--  1 root root 339K Jan  7 10:33 bow_transformer.pk
drwxr-xr-x 27 root root 4.0K Jan  2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
-rw-r--r--  1 root root 875K Jan  7 10:35 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
drwxr-xr-x 12 root root 4.0K Jan  3 17:15 cpp_clean_files_top_10_cwe_omitted
-rw-r--r--  1 root root  15K Jan  7 10:35 cpp_clean_files_top_10_cwe_omitted.tar.gz
-rw-r--r--  1 root root  11M Jan  7 10:32 data_drop_na.csv
-rw-r--r--  1 root root 4.9M Jan  7 10:35 nb_model.pk
-rw-r--r--  1 root root 2.9M Jan  7 10:35 sard.zip
-rw-r--r--  1 root root 201K Jan  7 10:33 tfidf_transformer.pk
# !ls -lh /root/.keras/datasets/
# ## -rw-r--r-- 1 root root 2.3K Jan  2 08:00 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
# data_path ## /root/.keras/datasets/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
data_path = 'cpp_cleaner_8750_files_each_350_top_25_cwe_omitted'
# !ls -lh /root/.keras/datasets/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
"""
## Let's take a look at the data
"""

data_dir = pathlib.Path(data_path).parent / "cpp_cleaner_8750_files_each_350_top_25_cwe_omitted"
dirnames = os.listdir(data_dir)
print("Number of directories:", len(dirnames))
print("Directory names:", dirnames)

fnames = os.listdir(data_dir / "CWE122_Heap_Based_Buffer_Overflow")
print("Number of files in CWE122_Heap_Based_Buffer_Overflow:", len(fnames))
print("Some example filenames:", fnames[:5])

# Number of directories: 25
# Directory names:
## ['CWE369_Divide_by_Zero', 'CWE762_Mismatched_Memory_Management_Routines',
## 'CWE195_Signed_to_Unsigned_Conversion_Error', 'CWE36_Absolute_Path_Traversal',
## 'CWE590_Free_Memory_Not_on_Heap', 'CWE690_NULL_Deref_From_Return', 'CWE789_Uncontrolled_Mem_Alloc',
## 'CWE197_Numeric_Truncation_Error', 'CWE401_Memory_Leak', 'CWE400_Resource_Exhaustion',
## 'CWE127_Buffer_Underread', 'CWE121_Stack_Based_Buffer_Overflow', 'CWE191_Integer_Underflow',
## 'CWE122_Heap_Based_Buffer_Overflow', 'CWE190_Integer_Overflow', 'CWE563_Unused_Variable',
## 'CWE194_Unexpected_Sign_Extension', 'CWE23_Relative_Path_Traversal', 'CWE457_Use_of_Uninitialized_Variable',
## 'CWE415_Double_Free', 'CWE126_Buffer_Overread', 'CWE134_Uncontrolled_Format_String',
## 'CWE680_Integer_Overflow_to_Buffer_Overflow', 'CWE78_OS_Command_Injection', 'CWE124_Buffer_Underwrite']

# Number of files in CWE122_Heap_Based_Buffer_Overflow: 316
# Some example filenames:
## ['t_memcpy_72a.cpp', 'loop_61b.cpp', 'dest_char_cat_34.cpp', 'ncpy_54b.cpp', 'ncpy_43.cpp']
Number of directories: 25
Directory names: ['CWE124_Buffer_Underwrite', 'CWE122_Heap_Based_Buffer_Overflow', 'CWE194_Unexpected_Sign_Extension', 'CWE401_Memory_Leak', 'CWE126_Buffer_Overread', 'CWE36_Absolute_Path_Traversal', 'CWE457_Use_of_Uninitialized_Variable', 'CWE195_Signed_to_Unsigned_Conversion_Error', 'CWE197_Numeric_Truncation_Error', 'CWE78_OS_Command_Injection', 'CWE563_Unused_Variable', 'CWE127_Buffer_Underread', 'CWE680_Integer_Overflow_to_Buffer_Overflow', 'CWE23_Relative_Path_Traversal', 'CWE121_Stack_Based_Buffer_Overflow', 'CWE369_Divide_by_Zero', 'CWE190_Integer_Overflow', 'CWE415_Double_Free', 'CWE762_Mismatched_Memory_Management_Routines', 'CWE690_NULL_Deref_From_Return', 'CWE590_Free_Memory_Not_on_Heap', 'CWE191_Integer_Underflow', 'CWE400_Resource_Exhaustion', 'CWE134_Uncontrolled_Format_String', 'CWE789_Uncontrolled_Mem_Alloc']
Number of files in CWE122_Heap_Based_Buffer_Overflow: 316
Some example filenames: ['socket_82_goodG2B.cpp', 'cpy_82a.cpp', 't_memcpy_43.cpp', 't_cpy_63b.cpp', 'snprintf_84_goodG2B.cpp']
"""
Here's a example of what one file contains:
"""

## /content/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted/CWE122_Heap_Based_Buffer_Overflow/83_goodG2B.cpp
# print(open(data_dir / "CWE122_Heap_Based_Buffer_Overflow" / "83_goodG2B.cpp").read())

"""
As you can see, there are header lines that are leaking the file's category, either
explicitly (the first line is literally the category name), or implicitly, e.g. via the
`Organization` filed. Let's get rid of the headers:
"""
{"type":"string"}
# Initialize the count
count = 0

# Traverse the directory tree
# for dirpath, dirs, files in os.walk('cpp_cleaner_8750_files_each_350_top_25_cwe_omitted'):
for dirpath, dirs, files in os.walk('cpp_cleaner_8750_files_each_350_top_25_cwe_omitted'):
    for filename in files:
        # Count the occurrences of 'CWE' in the file name
        count += len(re.findall(r'CWE', filename))
        # Get the file path
        file_path = os.path.join(dirpath, filename)
        with open(file_path, 'r') as file:
            try:
                file_data = file.read()
                # Count the occurrences of 'CWE' in the file data
                count += len(re.findall(r'CWE', file_data))
            except UnicodeDecodeError:
                # Skip files that can't be decoded
                pass

print(count)
## 1052 ## clean
## 0 ## cleaner
0
# Get a list of all file paths and their corresponding labels
root_folder_path = 'cpp_cleaner_8750_files_each_350_top_25_cwe_omitted'
file_paths = []
labels = []
for dirpath, dirnames, filenames in os.walk(root_folder_path):
    for filename in filenames:
        file_paths.append(os.path.join(dirpath, filename))
        labels.append(os.path.basename(dirpath))  # Use the name of the parent directory as the label

# Split the file paths into training and testing sets
file_paths_train, file_paths_test, labels_train, labels_test = train_test_split(file_paths, labels, test_size=0.2, stratify=labels)

# Function to copy files to a new directory structure
def copy_files(file_paths, labels, dest_folder):
    for file_path, label in zip(file_paths, labels):
        dest_dir = os.path.join(dest_folder, label)
        if not os.path.isdir(dest_dir):
            os.makedirs(dest_dir)
        shutil.copy(file_path, dest_dir)

# Copy the files to the 'train' and 'test' directories
copy_files(file_paths_train, labels_train, 'train')
copy_files(file_paths_test, labels_test, 'test')
!find train -type f | cut -d/ -f2 | sort | uniq -c
    259 CWE121_Stack_Based_Buffer_Overflow
    253 CWE122_Heap_Based_Buffer_Overflow
    265 CWE124_Buffer_Underwrite
    268 CWE126_Buffer_Overread
    266 CWE127_Buffer_Underread
    280 CWE134_Uncontrolled_Format_String
    238 CWE190_Integer_Overflow
    235 CWE191_Integer_Underflow
    132 CWE194_Unexpected_Sign_Extension
    127 CWE195_Signed_to_Unsigned_Conversion_Error
    280 CWE197_Numeric_Truncation_Error
    280 CWE23_Relative_Path_Traversal
    280 CWE369_Divide_by_Zero
    280 CWE36_Absolute_Path_Traversal
    125 CWE400_Resource_Exhaustion
    266 CWE401_Memory_Leak
    280 CWE415_Double_Free
    238 CWE457_Use_of_Uninitialized_Variable
    280 CWE563_Unused_Variable
    278 CWE590_Free_Memory_Not_on_Heap
    241 CWE680_Integer_Overflow_to_Buffer_Overflow
    134 CWE690_NULL_Deref_From_Return
    279 CWE762_Mismatched_Memory_Management_Routines
    258 CWE789_Uncontrolled_Mem_Alloc
    280 CWE78_OS_Command_Injection
def count_files_in_subfolders(root_folder):
    for dirpath, dirnames, filenames in os.walk(root_folder):
        print(f"There are {len(filenames)} files in the '{os.path.relpath(dirpath, root_folder)}' subfolder.")

# print("In the 'train' directory:")
# count_files_in_subfolders('train')

print("\nIn the 'test' directory:")
count_files_in_subfolders('test')

In the 'test' directory:
There are 0 files in the '.' subfolder.
There are 66 files in the 'CWE124_Buffer_Underwrite' subfolder.
There are 63 files in the 'CWE122_Heap_Based_Buffer_Overflow' subfolder.
There are 33 files in the 'CWE194_Unexpected_Sign_Extension' subfolder.
There are 67 files in the 'CWE401_Memory_Leak' subfolder.
There are 67 files in the 'CWE126_Buffer_Overread' subfolder.
There are 70 files in the 'CWE36_Absolute_Path_Traversal' subfolder.
There are 59 files in the 'CWE457_Use_of_Uninitialized_Variable' subfolder.
There are 31 files in the 'CWE195_Signed_to_Unsigned_Conversion_Error' subfolder.
There are 70 files in the 'CWE197_Numeric_Truncation_Error' subfolder.
There are 70 files in the 'CWE78_OS_Command_Injection' subfolder.
There are 70 files in the 'CWE563_Unused_Variable' subfolder.
There are 67 files in the 'CWE127_Buffer_Underread' subfolder.
There are 60 files in the 'CWE680_Integer_Overflow_to_Buffer_Overflow' subfolder.
There are 70 files in the 'CWE23_Relative_Path_Traversal' subfolder.
There are 65 files in the 'CWE121_Stack_Based_Buffer_Overflow' subfolder.
There are 70 files in the 'CWE369_Divide_by_Zero' subfolder.
There are 60 files in the 'CWE190_Integer_Overflow' subfolder.
There are 70 files in the 'CWE415_Double_Free' subfolder.
There are 70 files in the 'CWE762_Mismatched_Memory_Management_Routines' subfolder.
There are 33 files in the 'CWE690_NULL_Deref_From_Return' subfolder.
There are 70 files in the 'CWE590_Free_Memory_Not_on_Heap' subfolder.
There are 59 files in the 'CWE191_Integer_Underflow' subfolder.
There are 31 files in the 'CWE400_Resource_Exhaustion' subfolder.
There are 70 files in the 'CWE134_Uncontrolled_Format_String' subfolder.
There are 65 files in the 'CWE789_Uncontrolled_Mem_Alloc' subfolder.
def count_files_in_folder(root_folder):
    total_files = 0
    for dirpath, dirnames, filenames in os.walk(root_folder):
        total_files += len(filenames)
    print(f"There are {total_files} files in the '{root_folder}' directory.")

print("In the 'train' directory:")
count_files_in_folder('train')

print("\nIn the 'test' directory:")
count_files_in_folder('test')

## In the 'train' directory:
### There are 6102 files in the 'train' directory.

## In the 'test' directory:
### There are 2781 files in the 'test' directory.
In the 'train' directory:
There are 6102 files in the 'train' directory.

In the 'test' directory:
There are 1526 files in the 'test' directory.
## cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
# !tar -czf cpp_ready_8750_files_each_350_top_25_cwe_omitted.tar.gz cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
# !tar -czf cpp_ready_8750_files_each_350_top_25_cwe_omitted.tar.gz cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
!tar -czf cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz train test

......

Load text

# !wget https://samate.nist.gov/SARD/downloads/test-suites/2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
# keras.utils.text_dataset_from_directory(
#     directory,
#     labels="inferred",
#     label_mode="int",
#     class_names=None,
#     batch_size=32,
#     max_length=None,
#     shuffle=True,
#     seed=None,
#     validation_split=None,
#     subset=None,
#     follow_links=False,
# )

Predict the tag for a SARD TestCase

download a dataset of programming TestCases from SARD. Each TestCase is labeled with exactly one tag (CWE-ID#121-***, CWE-ID#122-***, etc).

!mv cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz _cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
## !wget https://raw.githubusercontent.com/c6ai/temp/main/cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
# !gdown 1j1nY2qlLnA_Iap0_ug8ZAQuDgKX1QIJB

## !gdown 1Q_P8bYpvdSEbp6NnCzfqU3lwQwxUlfE3
# !wget https://raw.githubusercontent.com/c6ai/temp/main/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz

# data_url = 'https://storage.googleapis.com/download.tensorflow.org/data/*.tar.gz'
data_url = 'https://raw.githubusercontent.com/c6ai/temp/main/cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz'

dataset_dir = utils.get_file(
    origin=data_url,
    untar=True,
    cache_dir='cache_dir',
    cache_subdir='')

dataset_dir = pathlib.Path(dataset_dir).parent
Downloading data from https://raw.githubusercontent.com/c6ai/temp/main/cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
899225/899225 [==============================] - 0s 0us/step
list(dataset_dir.iterdir())
[PosixPath('/tmp/.keras/cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz'),
 PosixPath('/tmp/.keras/test'),
 PosixPath('/tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted'),
 PosixPath('/tmp/.keras/cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz'),
 PosixPath('/tmp/.keras/train')]
train_dir = dataset_dir/'train'
list(train_dir.iterdir())
[PosixPath('/tmp/.keras/train/CWE124_Buffer_Underwrite'),
 PosixPath('/tmp/.keras/train/CWE122_Heap_Based_Buffer_Overflow'),
 PosixPath('/tmp/.keras/train/CWE194_Unexpected_Sign_Extension'),
 PosixPath('/tmp/.keras/train/CWE401_Memory_Leak'),
 PosixPath('/tmp/.keras/train/CWE126_Buffer_Overread'),
 PosixPath('/tmp/.keras/train/CWE36_Absolute_Path_Traversal'),
 PosixPath('/tmp/.keras/train/CWE457_Use_of_Uninitialized_Variable'),
 PosixPath('/tmp/.keras/train/CWE195_Signed_to_Unsigned_Conversion_Error'),
 PosixPath('/tmp/.keras/train/CWE197_Numeric_Truncation_Error'),
 PosixPath('/tmp/.keras/train/CWE78_OS_Command_Injection'),
 PosixPath('/tmp/.keras/train/CWE563_Unused_Variable'),
 PosixPath('/tmp/.keras/train/CWE127_Buffer_Underread'),
 PosixPath('/tmp/.keras/train/CWE680_Integer_Overflow_to_Buffer_Overflow'),
 PosixPath('/tmp/.keras/train/CWE23_Relative_Path_Traversal'),
 PosixPath('/tmp/.keras/train/CWE121_Stack_Based_Buffer_Overflow'),
 PosixPath('/tmp/.keras/train/CWE369_Divide_by_Zero'),
 PosixPath('/tmp/.keras/train/CWE190_Integer_Overflow'),
 PosixPath('/tmp/.keras/train/CWE415_Double_Free'),
 PosixPath('/tmp/.keras/train/CWE762_Mismatched_Memory_Management_Routines'),
 PosixPath('/tmp/.keras/train/CWE690_NULL_Deref_From_Return'),
 PosixPath('/tmp/.keras/train/CWE590_Free_Memory_Not_on_Heap'),
 PosixPath('/tmp/.keras/train/CWE191_Integer_Underflow'),
 PosixPath('/tmp/.keras/train/CWE400_Resource_Exhaustion'),
 PosixPath('/tmp/.keras/train/CWE134_Uncontrolled_Format_String'),
 PosixPath('/tmp/.keras/train/CWE789_Uncontrolled_Mem_Alloc')]

The train/CWE-ID#121-***, train/CWE-ID#122-***, ... directories contain many text files, each of which is a SARD TestCase.

## ValueError: No text files found in directory /tmp/.keras/train. Allowed format: .txt
# import glob

def change_file_extension(root_folder, old_ext, new_ext):
    for filename in glob.iglob(root_folder + '**/*' + old_ext, recursive=True):
        base = os.path.splitext(filename)[0]
        os.rename(filename, base + new_ext)

change_file_extension('/tmp/.keras/', '.cpp', '.txt')
# !ls /tmp/.keras/train/CWE122_Heap_Based_Buffer_Overflow
## print(open(data_dir / "CWE122_Heap_Based_Buffer_Overflow" / "83_goodG2B.cpp").read())

# sample_file = train_dir/'python/1755.txt'
# sample_file = train_dir/'CWE122_Heap_Based_Buffer_Overflow/83_goodG2B.cpp'
sample_file = train_dir/'CWE122_Heap_Based_Buffer_Overflow/83_goodG2B.txt'

with open(sample_file) as f:
  print(f.read())


#ifndef OMITGOOD

#include "std_testcase.h"
#include "83.h"

namespace 83
{
83_goodG2B::83_goodG2B(int dataCopy)
{
    data = dataCopy;
    
    data = 7;
}

83_goodG2B::~83_goodG2B()
{
    {
        int i;
        int * buffer = new int[10];
        
        for (i = 0; i < 10; i++)
        {
            buffer[i] = 0;
        }
        
        if (data >= 0)
        {
            buffer[data] = 1;
            
            for(i = 0; i < 10; i++)
            {
                printIntLine(buffer[i]);
            }
        }
        else
        {
            printLine("ERROR: Array index is negative.");
        }
        delete[] buffer;
    }
}
}
#endif 

Load the dataset

The tf.keras.utils.text_dataset_from_directory API expects a directory structure as follows:

train/
...CWE-ID#121-***/
......1.txt
......2.txt
...CWE-ID#122-***/
......1.txt
......2.txt
...CWE-ID#1xx-***/
......1.txt
......2.txt
...CWE-ID#1x-***n/
......1.txt
......2.txt
batch_size = batch_size
seed = seed

raw_train_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

## Found 6102 files belonging to 25 classes.
## Using 4882 files for training.
Found 6102 files belonging to 25 classes.
Using 4882 files for training.
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print("TestCase: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])
TestCase:  b'\n\n#ifndef OMITGOOD\n\n#include "std_testcase.h"\n#include "file_snprintf_81.h"\n\n#ifdef _WIN32\n#define SNPRINTF _snprintf\n#else\n#define SNPRINTF snprintf\n#endif\n\nnamespace file_snprintf_81\n{\n\nvoid file_snprintf_81_goodB2G::action(char * data) const\n{\n    {\n        char dest[100] = "";\n        \n        SNPRINTF(dest, 100-1, "%s", data);\n        printLine(dest);\n    }\n}\n\n}\n#endif \n'
Label: 5
TestCase:  b'\n\n\n#include "std_testcase.h"\n\n#ifdef _WIN32\n#define BASEPATH "c:\\\\temp\\\\"\n#else\n#include <wchar.h>\n#define BASEPATH "/tmp/"\n#endif\n\n#ifdef _WIN32\n#include <winsock2.h>\n#include <windows.h>\n#include <direct.h>\n#pragma comment(lib, "ws2_32") \n#define CLOSE_SOCKET closesocket\n#else \n#include <sys/types.h>\n#include <sys/socket.h>\n#include <netinet/in.h>\n#include <arpa/inet.h>\n#include <unistd.h>\n#define INVALID_SOCKET -1\n#define SOCKET_ERROR -1\n#define CLOSE_SOCKET close\n#define SOCKET int\n#endif\n\n#define TCP_PORT 27015\n#define IP_ADDRESS "127.0.0.1"\n\n\nnamespace connect_socket_w32CreateFile_53\n{\n\n#ifndef OMITBAD\n\n\nvoid badSink_b(char * data);\n\nvoid bad()\n{\n    char * data;\n    char dataBuffer[FILENAME_MAX] = BASEPATH;\n    data = dataBuffer;\n    {\n#ifdef _WIN32\n        WSADATA wsaData;\n        int wsaDataInit = 0;\n#endif\n        int recvResult;\n        struct sockaddr_in service;\n        char *replace;\n        SOCKET connectSocket = INVALID_SOCKET;\n        size_t dataLen = strlen(data);\n        do\n        {\n#ifdef _WIN32\n            if (WSAStartup(MAKEWORD(2,2), &wsaData) != NO_ERROR)\n            {\n                break;\n            }\n            wsaDataInit = 1;\n#endif\n            \n            connectSocket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);\n            if (connectSocket == INVALID_SOCKET)\n            {\n                break;\n            }\n            memset(&service, 0, sizeof(service));\n            service.sin_family = AF_INET;\n            service.sin_addr.s_addr = inet_addr(IP_ADDRESS);\n            service.sin_port = htons(TCP_PORT);\n            if (connect(connectSocket, (struct sockaddr*)&service, sizeof(service)) == SOCKET_ERROR)\n            {\n                break;\n            }\n            \n            \n            recvResult = recv(connectSocket, (char *)(data + dataLen), sizeof(char) * (FILENAME_MAX - dataLen - 1), 0);\n            if (recvResult == SOCKET_ERROR || recvResult == 0)\n            {\n                break;\n            }\n            \n            data[dataLen + recvResult / sizeof(char)] = \'\\0\';\n            \n            replace = strchr(data, \'\\r\');\n            if (replace)\n            {\n                *replace = \'\\0\';\n            }\n            replace = strchr(data, \'\\n\');\n            if (replace)\n            {\n                *replace = \'\\0\';\n            }\n        }\n        while (0);\n        if (connectSocket != INVALID_SOCKET)\n        {\n            CLOSE_SOCKET(connectSocket);\n        }\n#ifdef _WIN32\n        if (wsaDataInit)\n        {\n            WSACleanup();\n        }\n#endif\n    }\n    badSink_b(data);\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nvoid goodG2BSink_b(char * data);\n\n\nstatic void goodG2B()\n{\n    char * data;\n    char dataBuffer[FILENAME_MAX] = BASEPATH;\n    data = dataBuffer;\n    \n    strcat(data, "file.txt");\n    goodG2BSink_b(data);\n}\n\nvoid good()\n{\n    goodG2B();\n}\n\n#endif \n\n} \n\n\n\n#ifdef INCLUDEMAIN\n\nusing namespace connect_socket_w32CreateFile_53; \n\nint main(int argc, char * argv[])\n{\n    \n    srand( (unsigned)time(NULL) );\n#ifndef OMITGOOD\n    printLine("Calling good()...");\n    good();\n    printLine("Finished good()");\n#endif \n#ifndef OMITBAD\n    printLine("Calling bad()...");\n    bad();\n    printLine("Finished bad()");\n#endif \n    return 0;\n}\n\n#endif\n'
Label: 11
TestCase:  b'\n\n\n#include "std_testcase.h"\n\n#include <wchar.h>\n\nnamespace char_alloca_65\n{\n\n#ifndef OMITBAD\n\n\nvoid badSink(char * data);\n\nvoid bad()\n{\n    char * data;\n    \n    void (*funcPtr) (char *) = badSink;\n    data = NULL; \n    {\n        \n        char * dataBuffer = (char *)ALLOCA(sizeof(char));\n        *dataBuffer = \'A\';\n        data = dataBuffer;\n    }\n    \n    funcPtr(data);\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nvoid goodG2BSink(char * data);\n\nstatic void goodG2B()\n{\n    char * data;\n    void (*funcPtr) (char *) = goodG2BSink;\n    data = NULL; \n    {\n        \n        char * dataBuffer = new char;\n        *dataBuffer = \'A\';\n        data = dataBuffer;\n    }\n    funcPtr(data);\n}\n\nvoid good()\n{\n    goodG2B();\n}\n\n#endif \n\n} \n\n\n\n#ifdef INCLUDEMAIN\n\nusing namespace char_alloca_65; \n\nint main(int argc, char * argv[])\n{\n    \n    srand( (unsigned)time(NULL) );\n#ifndef OMITGOOD\n    printLine("Calling good()...");\n    good();\n    printLine("Finished good()");\n#endif \n#ifndef OMITBAD\n    printLine("Calling bad()...");\n    bad();\n    printLine("Finished bad()");\n#endif \n    return 0;\n}\n\n#endif\n'
Label: 19
TestCase:  b'\n\n\n#include "std_testcase.h"\n\nnamespace fscanf_62\n{\n\n#ifndef OMITBAD\n\nvoid badSource(int &data)\n{\n    \n    fscanf(stdin, "%d", &data);\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nvoid goodG2BSource(int &data)\n{\n    \n    data = 20;\n}\n\n#endif \n\n} \n'
Label: 20
TestCase:  b'\n\n#ifndef OMITBAD\n\n#include "std_testcase.h"\n#include "listen_socket_square_81.h"\n\n#include <math.h>\n\nnamespace listen_socket_square_81\n{\n\nvoid listen_socket_square_81_bad::action(int data) const\n{\n    {\n        \n        int result = data * data;\n        printIntLine(result);\n    }\n}\n\n}\n#endif \n'
Label: 6
TestCase:  b'\n\n\n#include "std_testcase.h"\n\n#include <wchar.h>\n\n\nstatic const int STATIC_CONST_FIVE = 5;\n\nnamespace class_placement_new_06\n{\n\n#ifndef OMITBAD\n\nvoid bad()\n{\n    TwoIntsClass * data;\n    data = NULL; \n    if(STATIC_CONST_FIVE==5)\n    {\n        {\n            \n            char buffer[sizeof(TwoIntsClass)];\n            TwoIntsClass * dataBuffer = new(buffer) TwoIntsClass;\n            dataBuffer->intOne = 2;\n            dataBuffer->intTwo = 2;\n            data = dataBuffer;\n        }\n    }\n    printIntLine(data->intOne);\n    \n    delete data;\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nstatic void goodG2B1()\n{\n    TwoIntsClass * data;\n    data = NULL; \n    if(STATIC_CONST_FIVE!=5)\n    {\n        \n        printLine("Benign, fixed string");\n    }\n    else\n    {\n        {\n            \n            TwoIntsClass * dataBuffer = new TwoIntsClass;\n            dataBuffer->intOne = 2;\n            dataBuffer->intTwo = 2;\n            data = dataBuffer;\n        }\n    }\n    printIntLine(data->intOne);\n    \n    delete data;\n}\n\n\nstatic void goodG2B2()\n{\n    TwoIntsClass * data;\n    data = NULL; \n    if(STATIC_CONST_FIVE==5)\n    {\n        {\n            \n            TwoIntsClass * dataBuffer = new TwoIntsClass;\n            dataBuffer->intOne = 2;\n            dataBuffer->intTwo = 2;\n            data = dataBuffer;\n        }\n    }\n    printIntLine(data->intOne);\n    \n    delete data;\n}\n\nvoid good()\n{\n    goodG2B1();\n    goodG2B2();\n}\n\n#endif \n\n} \n\n\n\n#ifdef INCLUDEMAIN\n\nusing namespace class_placement_new_06; \n\nint main(int argc, char * argv[])\n{\n    \n    srand( (unsigned)time(NULL) );\n#ifndef OMITGOOD\n    printLine("Calling good()...");\n    good();\n    printLine("Finished good()");\n#endif \n#ifndef OMITBAD\n    printLine("Calling bad()...");\n    bad();\n    printLine("Finished bad()");\n#endif \n    return 0;\n}\n\n#endif\n'
Label: 19
TestCase:  b'\n\n#ifndef OMITBAD\n\n#include "std_testcase.h"\n#include "array_free_struct_81.h"\n\nnamespace array_free_struct_81\n{\n\nvoid array_free_struct_81_bad::action(twoIntsStruct * data) const\n{\n    \n    free(data);\n}\n\n}\n#endif \n'
Label: 22
TestCase:  b'\n\n\n#include "std_testcase.h"\n\n#ifdef _WIN32\n#define BASEPATH L"c:\\\\temp\\\\"\n#else\n#include <wchar.h>\n#define BASEPATH L"/tmp/"\n#endif\n\n#ifdef _WIN32\n#define FILENAME "C:\\\\temp\\\\file.txt"\n#else\n#define FILENAME "/tmp/file.txt"\n#endif\n\n#ifdef _WIN32\n#define FOPEN _wfopen\n#else\n#define FOPEN fopen\n#endif\n\nnamespace t_file_fopen_45\n{\n\nstatic wchar_t * badData;\nstatic wchar_t * goodG2BData;\n\n#ifndef OMITBAD\n\nstatic void badSink()\n{\n    wchar_t * data = badData;\n    {\n        FILE *pFile = NULL;\n        \n        pFile = FOPEN(data, L"wb+");\n        if (pFile != NULL)\n        {\n            fclose(pFile);\n        }\n    }\n}\n\nvoid bad()\n{\n    wchar_t * data;\n    wchar_t dataBuffer[FILENAME_MAX] = BASEPATH;\n    data = dataBuffer;\n    {\n        \n        size_t dataLen = wcslen(data);\n        FILE * pFile;\n        \n        if (FILENAME_MAX-dataLen > 1)\n        {\n            pFile = fopen(FILENAME, "r");\n            if (pFile != NULL)\n            {\n                \n                if (fgetws(data+dataLen, (int)(FILENAME_MAX-dataLen), pFile) == NULL)\n                {\n                    printLine("fgetws() failed");\n                    \n                    data[dataLen] = L\'\\0\';\n                }\n                fclose(pFile);\n            }\n        }\n    }\n    badData = data;\n    badSink();\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nstatic void goodG2BSink()\n{\n    wchar_t * data = goodG2BData;\n    {\n        FILE *pFile = NULL;\n        \n        pFile = FOPEN(data, L"wb+");\n        if (pFile != NULL)\n        {\n            fclose(pFile);\n        }\n    }\n}\n\nstatic void goodG2B()\n{\n    wchar_t * data;\n    wchar_t dataBuffer[FILENAME_MAX] = BASEPATH;\n    data = dataBuffer;\n    \n    wcscat(data, L"file.txt");\n    goodG2BData = data;\n    goodG2BSink();\n}\n\nvoid good()\n{\n    goodG2B();\n}\n\n#endif \n\n} \n\n\n\n#ifdef INCLUDEMAIN\n\nusing namespace t_file_fopen_45; \n\nint main(int argc, char * argv[])\n{\n    \n    srand( (unsigned)time(NULL) );\n#ifndef OMITGOOD\n    printLine("Calling good()...");\n    good();\n    printLine("Finished good()");\n#endif \n#ifndef OMITBAD\n    printLine("Calling bad()...");\n    bad();\n    printLine("Finished bad()");\n#endif \n    return 0;\n}\n\n#endif\n'
Label: 11
TestCase:  b'\n\n\n#include "std_testcase.h"\n\n#include <wchar.h>\n\nnamespace char_memcpy_22\n{\n\n#ifndef OMITBAD\n\n\nextern int badGlobal;\n\nchar * badSource(char * data)\n{\n    if(badGlobal)\n    {\n        {\n            char * dataBuffer = new char[100];\n            memset(dataBuffer, \'A\', 100-1);\n            dataBuffer[100-1] = \'\\0\';\n            \n            data = dataBuffer - 8;\n        }\n    }\n    return data;\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nextern int goodG2B1Global;\nextern int goodG2B2Global;\n\n\nchar * goodG2B1Source(char * data)\n{\n    if(goodG2B1Global)\n    {\n        \n        printLine("Benign, fixed string");\n    }\n    else\n    {\n        {\n            char * dataBuffer = new char[100];\n            memset(dataBuffer, \'A\', 100-1);\n            dataBuffer[100-1] = \'\\0\';\n            \n            data = dataBuffer;\n        }\n    }\n    return data;\n}\n\n\nchar * goodG2B2Source(char * data)\n{\n    if(goodG2B2Global)\n    {\n        {\n            char * dataBuffer = new char[100];\n            memset(dataBuffer, \'A\', 100-1);\n            dataBuffer[100-1] = \'\\0\';\n            \n            data = dataBuffer;\n        }\n    }\n    return data;\n}\n\n#endif \n\n} \n'
Label: 4
TestCase:  b'\n\n\n#include "std_testcase.h"\n\n#include <wchar.h>\n\n\nstatic int staticFive = 5;\n\nnamespace array_int64_t_static_07\n{\n\n#ifndef OMITBAD\n\nvoid bad()\n{\n    int64_t * data;\n    data = NULL; \n    if(staticFive==5)\n    {\n        {\n            \n            static int64_t dataBuffer[100];\n            {\n                size_t i;\n                for (i = 0; i < 100; i++)\n                {\n                    dataBuffer[i] = 5LL;\n                }\n            }\n            data = dataBuffer;\n        }\n    }\n    printLongLongLine(data[0]);\n    \n    delete [] data;\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nstatic void goodG2B1()\n{\n    int64_t * data;\n    data = NULL; \n    if(staticFive!=5)\n    {\n        \n        printLine("Benign, fixed string");\n    }\n    else\n    {\n        {\n            \n            int64_t * dataBuffer = new int64_t[100];\n            {\n                size_t i;\n                for (i = 0; i < 100; i++)\n                {\n                    dataBuffer[i] = 5LL;\n                }\n            }\n            data = dataBuffer;\n        }\n    }\n    printLongLongLine(data[0]);\n    \n    delete [] data;\n}\n\n\nstatic void goodG2B2()\n{\n    int64_t * data;\n    data = NULL; \n    if(staticFive==5)\n    {\n        {\n            \n            int64_t * dataBuffer = new int64_t[100];\n            {\n                size_t i;\n                for (i = 0; i < 100; i++)\n                {\n                    dataBuffer[i] = 5LL;\n                }\n            }\n            data = dataBuffer;\n        }\n    }\n    printLongLongLine(data[0]);\n    \n    delete [] data;\n}\n\nvoid good()\n{\n    goodG2B1();\n    goodG2B2();\n}\n\n#endif \n\n} \n\n\n\n#ifdef INCLUDEMAIN\n\nusing namespace array_int64_t_static_07; \n\nint main(int argc, char * argv[])\n{\n    \n    srand( (unsigned)time(NULL) );\n#ifndef OMITGOOD\n    printLine("Calling good()...");\n    good();\n    printLine("Finished good()");\n#endif \n#ifndef OMITBAD\n    printLine("Calling bad()...");\n    bad();\n    printLine("Finished bad()");\n#endif \n    return 0;\n}\n\n#endif\n'
Label: 19
for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)
Label 0 corresponds to CWE121_Stack_Based_Buffer_Overflow
Label 1 corresponds to CWE122_Heap_Based_Buffer_Overflow
Label 2 corresponds to CWE124_Buffer_Underwrite
Label 3 corresponds to CWE126_Buffer_Overread
Label 4 corresponds to CWE127_Buffer_Underread
Label 5 corresponds to CWE134_Uncontrolled_Format_String
Label 6 corresponds to CWE190_Integer_Overflow
Label 7 corresponds to CWE191_Integer_Underflow
Label 8 corresponds to CWE194_Unexpected_Sign_Extension
Label 9 corresponds to CWE195_Signed_to_Unsigned_Conversion_Error
Label 10 corresponds to CWE197_Numeric_Truncation_Error
Label 11 corresponds to CWE23_Relative_Path_Traversal
Label 12 corresponds to CWE369_Divide_by_Zero
Label 13 corresponds to CWE36_Absolute_Path_Traversal
Label 14 corresponds to CWE400_Resource_Exhaustion
Label 15 corresponds to CWE401_Memory_Leak
Label 16 corresponds to CWE415_Double_Free
Label 17 corresponds to CWE457_Use_of_Uninitialized_Variable
Label 18 corresponds to CWE563_Unused_Variable
Label 19 corresponds to CWE590_Free_Memory_Not_on_Heap
Label 20 corresponds to CWE680_Integer_Overflow_to_Buffer_Overflow
Label 21 corresponds to CWE690_NULL_Deref_From_Return
Label 22 corresponds to CWE762_Mismatched_Memory_Management_Routines
Label 23 corresponds to CWE789_Uncontrolled_Mem_Alloc
Label 24 corresponds to CWE78_OS_Command_Injection
# Label 0 corresponds to CWE121_Stack_Based_Buffer_Overflow
# Label 1 corresponds to CWE122_Heap_Based_Buffer_Overflow
# Label 2 corresponds to CWE124_Buffer_Underwrite
# Label 3 corresponds to CWE126_Buffer_Overread
# Label 4 corresponds to CWE127_Buffer_Underread
# Label 5 corresponds to CWE134_Uncontrolled_Format_String
# Label 6 corresponds to CWE190_Integer_Overflow
# Label 7 corresponds to CWE191_Integer_Underflow
# Label 8 corresponds to CWE194_Unexpected_Sign_Extension
# Label 9 corresponds to CWE195_Signed_to_Unsigned_Conversion_Error
# Label 10 corresponds to CWE197_Numeric_Truncation_Error
# Label 11 corresponds to CWE23_Relative_Path_Traversal
# Label 12 corresponds to CWE369_Divide_by_Zero
# Label 13 corresponds to CWE36_Absolute_Path_Traversal
# Label 14 corresponds to CWE400_Resource_Exhaustion
# Label 15 corresponds to CWE401_Memory_Leak
# Label 16 corresponds to CWE415_Double_Free
# Label 17 corresponds to CWE457_Use_of_Uninitialized_Variable
# Label 18 corresponds to CWE563_Unused_Variable
# Label 19 corresponds to CWE590_Free_Memory_Not_on_Heap
# Label 20 corresponds to CWE680_Integer_Overflow_to_Buffer_Overflow
# Label 21 corresponds to CWE690_NULL_Deref_From_Return
# Label 22 corresponds to CWE762_Mismatched_Memory_Management_Routines
# Label 23 corresponds to CWE789_Uncontrolled_Mem_Alloc
# Label 24 corresponds to CWE78_OS_Command_Injection
# Create a validation set.
raw_val_ds = utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)
## Found 6102 files belonging to 25 classes.
## Using 1220 files for validation.
Found 6102 files belonging to 25 classes.
Using 1220 files for validation.
test_dir = dataset_dir/'test'

# Create a test set.
raw_test_ds = utils.text_dataset_from_directory(
    test_dir,
    batch_size=batch_size)
## Found 2781 files belonging to 25 classes.
Found 2781 files belonging to 25 classes.
raw_train_ds = raw_train_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
raw_val_ds = raw_val_ds.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
raw_test_ds = raw_test_ds.prefetch(buffer_size=tf.data.AUTOTUNE)

Prepare the dataset for training

VOCAB_SIZE = VOCAB_SIZE

# binary_vectorize_layer = TextVectorization(
multi_class_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='binary')
MAX_SEQUENCE_LENGTH = MAX_SEQUENCE_LENGTH

int_vectorize_layer = TextVectorization(
    max_tokens=VOCAB_SIZE,
    output_mode='int',
    output_sequence_length=MAX_SEQUENCE_LENGTH)
# %%time
# Make a text-only dataset (without labels), then call `TextVectorization.adapt`.
train_text = raw_train_ds.map(lambda text, labels: text)
multi_class_vectorize_layer.adapt(train_text)
int_vectorize_layer.adapt(train_text)
## CPU times: user 1.2 s, sys: 212 ms, total: 1.42 s
## Wall time: 981 ms
text_batch, label_batch = next(iter(raw_train_ds))
first_TestCase, first_label = text_batch[0], label_batch[0]
print("TestCase:", first_TestCase)
print("Label:", first_label)

## TestCase: tf.Tensor(b'\n\n\n#include "std_testcase.h"\n#include <vector>\n\nusing namespace std;\n\nnamespace t_rand_multiply_72 ...
## Label: tf.Tensor(7, shape=(), dtype=int32)
TestCase: tf.Tensor(b'\n\n\n#include "std_testcase.h"\n\n#ifndef _WIN32\n#include <wchar.h>\n#endif\n\n#define ENV_VARIABLE L"ADD"\n\n#ifdef _WIN32\n#define GETENV _wgetenv\n#else\n#define GETENV getenv\n#endif\n\n#ifdef _WIN32\n#define OPEN _wopen\n#define CLOSE _close\n#else\n#include <unistd.h>\n#define OPEN open\n#define CLOSE close\n#endif\n\nnamespace t_environment_open_54\n{\n\n\n\n#ifndef OMITBAD\n\nvoid badSink_e(wchar_t * data)\n{\n    {\n        int fileDesc;\n        \n        fileDesc = OPEN(data, O_RDWR|O_CREAT, S_IREAD|S_IWRITE);\n        if (fileDesc != -1)\n        {\n            CLOSE(fileDesc);\n        }\n    }\n}\n\n#endif \n\n#ifndef OMITGOOD\n\n\nvoid goodG2BSink_e(wchar_t * data)\n{\n    {\n        int fileDesc;\n        \n        fileDesc = OPEN(data, O_RDWR|O_CREAT, S_IREAD|S_IWRITE);\n        if (fileDesc != -1)\n        {\n            CLOSE(fileDesc);\n        }\n    }\n}\n\n#endif \n\n} \n', shape=(), dtype=string)
Label: tf.Tensor(13, shape=(), dtype=int32)
# print("'binary' vectorized TestCase:",
#       list(multi_class_vectorize_layer(first_TestCase).numpy()))

plt.plot(multi_class_vectorize_layer(first_TestCase).numpy())
plt.xlim(0,1000)

## (0.0, 1000.0)
(0.0, 1000.0)

print("'int' vectorized TestCase:",
      int_vectorize_layer(first_TestCase).numpy())
'int' vectorized TestCase: [   5   24    6   26    5   47    3   14  375  534   18   26   14  166
  531   30   14  166  166    3   18   26   14  209  562   14   77   77
   30    5   93   14  209  209   14   77   77    3   12 2758    6   16
    4  807    2   10  178  178  366  365  363   13  178   19  369    3
    6   15    4  797    2   10  178  178  366  365  363   13  178   19
  369    3    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0]
# 'int' vectorized TestCase: [   5   24    5  226   29   12   71   12 1667    6   16    4  787   74
#    69    2  218  299    7   69  120    2   83  465    3    6   15    4
#   851   74   69    2  218  299    7   69  120    2   83  465    4  854
#    74   69    2  218  299    7   13    2  931   69  120    2   83  465
#    30  151  315   91  179  642  287  316  644    3    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0    0    0
#     0    0    0    0    0    0    0    0    0    0    0    0]
print("1289 ---> ", int_vectorize_layer.get_vocabulary()[1289])
print("313 ---> ", int_vectorize_layer.get_vocabulary()[313])
print("Vocabulary size: {}".format(len(int_vectorize_layer.get_vocabulary())))

## 1289 --->  wchartmemcpy81base
## 313 --->  6
## Vocabulary size: 9203
1289 --->  wchartmemcpy81base
313 --->  6
Vocabulary size: 9203

Baseline-beating CNN Model

# epochs=2 #10
# new_num_labels=25
# binary_model = tf.keras.Sequential([
clf_model = tf.keras.Sequential([
    # binary_vectorize_layer,
    multi_class_vectorize_layer,
    # layers.Dense(4)]) #25
    layers.Dense(new_num_labels)])

clf_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])
tf.keras.utils.plot_model(clf_model, show_shapes=True)
# ValueError: This model has not yet been built. Build the model first by calling `build()` or by calling the model on a batch of data.

%%time
bin_history = clf_model.fit(
    raw_train_ds, validation_data=raw_val_ds, epochs=epochs) #10) # epochs)

# print()

    # raw_train_ds, validation_data=raw_val_ds, epochs=2) # epochs) #epochs=10)
## CPU times: user 3.55 s, sys: 467 ms, total: 4.01 s
## Wall time: 2.21 s

## CPU times: user 13.7 s, sys: 1.73 s, total: 15.4 s
## Wall time: 16.3 s ## 7.31 s
Epoch 1/15
153/153 [==============================] - 2s 11ms/step - loss: 2.5163 - accuracy: 0.5401 - val_loss: 1.9911 - val_accuracy: 0.7434
Epoch 2/15
153/153 [==============================] - 1s 9ms/step - loss: 1.6466 - accuracy: 0.7911 - val_loss: 1.4668 - val_accuracy: 0.7893
Epoch 3/15
153/153 [==============================] - 1s 9ms/step - loss: 1.2317 - accuracy: 0.8429 - val_loss: 1.1847 - val_accuracy: 0.8139
Epoch 4/15
153/153 [==============================] - 1s 8ms/step - loss: 0.9859 - accuracy: 0.8707 - val_loss: 1.0081 - val_accuracy: 0.8279
Epoch 5/15
153/153 [==============================] - 1s 10ms/step - loss: 0.8208 - accuracy: 0.8959 - val_loss: 0.8869 - val_accuracy: 0.8426
Epoch 6/15
153/153 [==============================] - 2s 11ms/step - loss: 0.7014 - accuracy: 0.9121 - val_loss: 0.7984 - val_accuracy: 0.8492
Epoch 7/15
153/153 [==============================] - 1s 9ms/step - loss: 0.6106 - accuracy: 0.9252 - val_loss: 0.7308 - val_accuracy: 0.8566
Epoch 8/15
153/153 [==============================] - 1s 9ms/step - loss: 0.5390 - accuracy: 0.9353 - val_loss: 0.6774 - val_accuracy: 0.8623
Epoch 9/15
153/153 [==============================] - 1s 8ms/step - loss: 0.4811 - accuracy: 0.9437 - val_loss: 0.6342 - val_accuracy: 0.8680
Epoch 10/15
153/153 [==============================] - 1s 9ms/step - loss: 0.4332 - accuracy: 0.9525 - val_loss: 0.5985 - val_accuracy: 0.8697
Epoch 11/15
153/153 [==============================] - 1s 9ms/step - loss: 0.3929 - accuracy: 0.9588 - val_loss: 0.5685 - val_accuracy: 0.8721
Epoch 12/15
153/153 [==============================] - 2s 13ms/step - loss: 0.3585 - accuracy: 0.9629 - val_loss: 0.5429 - val_accuracy: 0.8713
Epoch 13/15
153/153 [==============================] - 1s 8ms/step - loss: 0.3289 - accuracy: 0.9662 - val_loss: 0.5209 - val_accuracy: 0.8779
Epoch 14/15
153/153 [==============================] - 1s 9ms/step - loss: 0.3030 - accuracy: 0.9685 - val_loss: 0.5017 - val_accuracy: 0.8779
Epoch 15/15
153/153 [==============================] - 1s 8ms/step - loss: 0.2803 - accuracy: 0.9699 - val_loss: 0.4848 - val_accuracy: 0.8811
CPU times: user 25.6 s, sys: 1.04 s, total: 26.7 s
Wall time: 29.7 s
# Epoch 1/10
# 153/153 [==============================] - 1s 5ms/step - loss: 1.2311 - accuracy: 0.8449 - val_loss: 1.1815 - val_accuracy: 0.8270
# Epoch 2/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.9858 - accuracy: 0.8744 - val_loss: 1.0057 - val_accuracy: 0.8377
# Epoch 3/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.8211 - accuracy: 0.8996 - val_loss: 0.8850 - val_accuracy: 0.8434
# Epoch 4/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.7018 - accuracy: 0.9146 - val_loss: 0.7968 - val_accuracy: 0.8516
# Epoch 5/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.6111 - accuracy: 0.9273 - val_loss: 0.7295 - val_accuracy: 0.8566
# Epoch 6/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.5395 - accuracy: 0.9373 - val_loss: 0.6764 - val_accuracy: 0.8631
# Epoch 7/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.4816 - accuracy: 0.9461 - val_loss: 0.6335 - val_accuracy: 0.8656
# Epoch 8/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.4337 - accuracy: 0.9537 - val_loss: 0.5980 - val_accuracy: 0.8689
# Epoch 9/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.3934 - accuracy: 0.9588 - val_loss: 0.5681 - val_accuracy: 0.8738
# Epoch 10/10
# 153/153 [==============================] - 1s 5ms/step - loss: 0.3590 - accuracy: 0.9635 - val_loss: 0.5427 - val_accuracy: 0.8762
# CPU times: user 13.7 s, sys: 1.73 s, total: 15.4 s
# Wall time: 7.31 s
def create_model(vocab_size, num_labels, vectorizer=None):
  my_layers =[]
  if vectorizer is not None:
    my_layers = [vectorizer]

  my_layers.extend([
      layers.Embedding(vocab_size, 64, mask_zero=True),
      layers.Dropout(0.5),
      layers.Conv1D(64, 5, padding="valid", activation="relu", strides=2),
      layers.GlobalMaxPooling1D(),
      layers.Dense(num_labels)
  ])

  model = tf.keras.Sequential(my_layers)
  return model
# `vocab_size` is `VOCAB_SIZE + 1` ## `0 used for padding additionally.

# int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=4, vectorizer=int_vectorize_layer)
int_model = create_model(vocab_size=VOCAB_SIZE + 1, num_labels=new_num_labels, vectorizer=int_vectorize_layer) #num_labels=4

tf.keras.utils.plot_model(int_model, show_shapes=True)

%%time
int_model.compile(
    loss=losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer='adam',
    metrics=['accuracy'])

int_history = int_model.fit(raw_train_ds, validation_data=raw_val_ds, epochs=epochs) # 10 # 15 #30

# int_history = int_model.fit(raw_train_ds, validation_data=raw_val_ds, epochs=10) # 10
## CPU times: user 1min 12s, sys: 6.5 s, total: 1min 18s
## Wall time: 36.9 s ## 19 s
Epoch 1/15
153/153 [==============================] - 7s 38ms/step - loss: 3.0561 - accuracy: 0.2003 - val_loss: 2.6361 - val_accuracy: 0.4566
Epoch 2/15
153/153 [==============================] - 5s 29ms/step - loss: 2.0293 - accuracy: 0.5244 - val_loss: 1.3931 - val_accuracy: 0.6369
Epoch 3/15
153/153 [==============================] - 6s 38ms/step - loss: 1.1298 - accuracy: 0.6940 - val_loss: 0.8160 - val_accuracy: 0.7893
Epoch 4/15
153/153 [==============================] - 4s 29ms/step - loss: 0.7124 - accuracy: 0.8097 - val_loss: 0.5603 - val_accuracy: 0.8484
Epoch 5/15
153/153 [==============================] - 5s 31ms/step - loss: 0.4967 - accuracy: 0.8644 - val_loss: 0.4404 - val_accuracy: 0.8746
Epoch 6/15
153/153 [==============================] - 5s 34ms/step - loss: 0.3827 - accuracy: 0.8978 - val_loss: 0.3835 - val_accuracy: 0.8836
Epoch 7/15
153/153 [==============================] - 6s 37ms/step - loss: 0.3063 - accuracy: 0.9166 - val_loss: 0.3424 - val_accuracy: 0.8910
Epoch 8/15
153/153 [==============================] - 4s 29ms/step - loss: 0.2523 - accuracy: 0.9254 - val_loss: 0.3256 - val_accuracy: 0.8951
Epoch 9/15
153/153 [==============================] - 6s 37ms/step - loss: 0.2111 - accuracy: 0.9385 - val_loss: 0.3080 - val_accuracy: 0.9000
Epoch 10/15
153/153 [==============================] - 5s 34ms/step - loss: 0.1796 - accuracy: 0.9461 - val_loss: 0.2971 - val_accuracy: 0.8992
Epoch 11/15
153/153 [==============================] - 5s 32ms/step - loss: 0.1619 - accuracy: 0.9478 - val_loss: 0.2930 - val_accuracy: 0.8975
Epoch 12/15
153/153 [==============================] - 4s 29ms/step - loss: 0.1423 - accuracy: 0.9562 - val_loss: 0.2880 - val_accuracy: 0.9008
Epoch 13/15
153/153 [==============================] - 6s 37ms/step - loss: 0.1264 - accuracy: 0.9623 - val_loss: 0.2938 - val_accuracy: 0.8992
Epoch 14/15
153/153 [==============================] - 4s 29ms/step - loss: 0.1131 - accuracy: 0.9650 - val_loss: 0.2860 - val_accuracy: 0.9008
Epoch 15/15
153/153 [==============================] - 4s 29ms/step - loss: 0.1054 - accuracy: 0.9676 - val_loss: 0.2886 - val_accuracy: 0.9000
CPU times: user 1min 45s, sys: 3.28 s, total: 1min 48s
Wall time: 1min 37s
# # Epoch 1/10
# # 153/153 [==============================] - 3s 13ms/step - loss: 1.1379 - accuracy: 0.6780 - val_loss: 0.9188 - val_accuracy: 0.7361
# # Epoch 2/10
# # 153/153 [==============================] - 2s 12ms/step - loss: 0.8141 - accuracy: 0.7638 - val_loss: 0.6918 - val_accuracy: 0.8008
# # Epoch 3/10
# # 153/153 [==============================] - 2s 11ms/step - loss: 0.5948 - accuracy: 0.8265 - val_loss: 0.5431 - val_accuracy: 0.8467
# # Epoch 4/10
# # 153/153 [==============================] - 2s 11ms/step - loss: 0.4447 - accuracy: 0.8734 - val_loss: 0.4490 - val_accuracy: 0.8680
# # Epoch 5/10
# # 153/153 [==============================] - 2s 11ms/step - loss: 0.3474 - accuracy: 0.9007 - val_loss: 0.3991 - val_accuracy: 0.8754
# # Epoch 6/10
# # 153/153 [==============================] - 2s 13ms/step - loss: 0.2838 - accuracy: 0.9177 - val_loss: 0.3596 - val_accuracy: 0.8877
# # Epoch 7/10
# # 153/153 [==============================] - 2s 12ms/step - loss: 0.2363 - accuracy: 0.9283 - val_loss: 0.3366 - val_accuracy: 0.8893
# # Epoch 8/10
# # 153/153 [==============================] - 2s 12ms/step - loss: 0.2001 - accuracy: 0.9359 - val_loss: 0.3241 - val_accuracy: 0.8926
# # Epoch 9/10
# # 153/153 [==============================] - 2s 12ms/step - loss: 0.1706 - accuracy: 0.9469 - val_loss: 0.3140 - val_accuracy: 0.8934
# # Epoch 10/10
# # 153/153 [==============================] - 2s 11ms/step - loss: 0.1559 - accuracy: 0.9502 - val_loss: 0.3109 - val_accuracy: 0.8959
# # CPU times: user 1min 12s, sys: 6.5 s, total: 1min 18s
# # Wall time: 19 s

# Epoch 1/30
# 153/153 [==============================] - 11s 55ms/step - loss: 3.0357 - accuracy: 0.2030 - val_loss: 2.5830 - val_accuracy: 0.4385
# Epoch 2/30
# 153/153 [==============================] - 6s 39ms/step - loss: 1.9925 - accuracy: 0.5104 - val_loss: 1.3674 - val_accuracy: 0.6861
# Epoch 3/30
# 153/153 [==============================] - 7s 48ms/step - loss: 1.0896 - accuracy: 0.7141 - val_loss: 0.7781 - val_accuracy: 0.7967
# Epoch 4/30
# 153/153 [==============================] - 6s 38ms/step - loss: 0.6872 - accuracy: 0.8081 - val_loss: 0.5537 - val_accuracy: 0.8459
# Epoch 5/30
# 153/153 [==============================] - 6s 39ms/step - loss: 0.4933 - accuracy: 0.8587 - val_loss: 0.4498 - val_accuracy: 0.8713
# Epoch 6/30
# 153/153 [==============================] - 6s 41ms/step - loss: 0.3738 - accuracy: 0.8935 - val_loss: 0.3869 - val_accuracy: 0.8877
# Epoch 7/30
# 153/153 [==============================] - 7s 47ms/step - loss: 0.2988 - accuracy: 0.9144 - val_loss: 0.3549 - val_accuracy: 0.8902
# Epoch 8/30
# 153/153 [==============================] - 6s 39ms/step - loss: 0.2461 - accuracy: 0.9228 - val_loss: 0.3320 - val_accuracy: 0.8959
# Epoch 9/30
# 153/153 [==============================] - 6s 39ms/step - loss: 0.2037 - accuracy: 0.9412 - val_loss: 0.3136 - val_accuracy: 0.8975
# Epoch 10/30
# 153/153 [==============================] - 7s 43ms/step - loss: 0.1732 - accuracy: 0.9480 - val_loss: 0.3036 - val_accuracy: 0.9016
# Epoch 11/30
# 153/153 [==============================] - 7s 48ms/step - loss: 0.1539 - accuracy: 0.9504 - val_loss: 0.3005 - val_accuracy: 0.9000
# Epoch 12/30
# 153/153 [==============================] - 6s 38ms/step - loss: 0.1366 - accuracy: 0.9584 - val_loss: 0.2904 - val_accuracy: 0.9025
# Epoch 13/30
# 153/153 [==============================] - 6s 37ms/step - loss: 0.1221 - accuracy: 0.9613 - val_loss: 0.2998 - val_accuracy: 0.9025
# Epoch 14/30
# 153/153 [==============================] - 8s 54ms/step - loss: 0.1096 - accuracy: 0.9625 - val_loss: 0.2899 - val_accuracy: 0.9008
# Epoch 15/30
# 153/153 [==============================] - 7s 45ms/step - loss: 0.0985 - accuracy: 0.9697 - val_loss: 0.2936 - val_accuracy: 0.9025
# Epoch 16/30
# 153/153 [==============================] - 7s 47ms/step - loss: 0.0949 - accuracy: 0.9670 - val_loss: 0.2931 - val_accuracy: 0.9008
# Epoch 17/30
# 153/153 [==============================] - 6s 37ms/step - loss: 0.0841 - accuracy: 0.9713 - val_loss: 0.2982 - val_accuracy: 0.9016
# Epoch 18/30
# 153/153 [==============================] - 7s 43ms/step - loss: 0.0769 - accuracy: 0.9744 - val_loss: 0.2974 - val_accuracy: 0.9049
# Epoch 19/30
# 153/153 [==============================] - 7s 48ms/step - loss: 0.0714 - accuracy: 0.9773 - val_loss: 0.3041 - val_accuracy: 0.8967
# Epoch 20/30
# 153/153 [==============================] - 7s 43ms/step - loss: 0.0670 - accuracy: 0.9783 - val_loss: 0.3038 - val_accuracy: 0.9016
# Epoch 21/30
# 153/153 [==============================] - 6s 37ms/step - loss: 0.0636 - accuracy: 0.9766 - val_loss: 0.3071 - val_accuracy: 0.9049
# Epoch 22/30
# 153/153 [==============================] - 7s 48ms/step - loss: 0.0563 - accuracy: 0.9818 - val_loss: 0.3019 - val_accuracy: 0.9066
# Epoch 23/30
# 153/153 [==============================] - 6s 40ms/step - loss: 0.0523 - accuracy: 0.9836 - val_loss: 0.3086 - val_accuracy: 0.9066
# Epoch 24/30
# 153/153 [==============================] - 7s 47ms/step - loss: 0.0530 - accuracy: 0.9814 - val_loss: 0.3137 - val_accuracy: 0.9057
# Epoch 25/30
# 153/153 [==============================] - 7s 44ms/step - loss: 0.0463 - accuracy: 0.9834 - val_loss: 0.3090 - val_accuracy: 0.9041
# Epoch 26/30
# 153/153 [==============================] - 6s 40ms/step - loss: 0.0478 - accuracy: 0.9836 - val_loss: 0.3241 - val_accuracy: 0.9033
# Epoch 27/30
# 153/153 [==============================] - 6s 39ms/step - loss: 0.0436 - accuracy: 0.9836 - val_loss: 0.3196 - val_accuracy: 0.9041
# Epoch 28/30
# 153/153 [==============================] - 6s 42ms/step - loss: 0.0439 - accuracy: 0.9853 - val_loss: 0.3269 - val_accuracy: 0.9000
# Epoch 29/30
# 153/153 [==============================] - 8s 49ms/step - loss: 0.0398 - accuracy: 0.9855 - val_loss: 0.3288 - val_accuracy: 0.9033
# Epoch 30/30
# 153/153 [==============================] - 6s 37ms/step - loss: 0.0373 - accuracy: 0.9861 - val_loss: 0.3301 - val_accuracy: 0.8984
# CPU times: user 4min 28s, sys: 8.96 s, total: 4min 36s
# Wall time: 4min 32s
loss = plt.plot(bin_history.epoch, bin_history.history['loss'], label='bin-loss')
plt.plot(bin_history.epoch, bin_history.history['val_loss'], '--', color=loss[0].get_color(), label='bin-val_loss')

loss = plt.plot(int_history.epoch, int_history.history['loss'], label='int-loss')
plt.plot(int_history.epoch, int_history.history['val_loss'], '--', color=loss[0].get_color(), label='int-val_loss')

plt.legend()
plt.xlabel('Epoch')
plt.ylabel('CE/token')
Text(0, 0.5, 'CE/token')

# Evaluate the model on the test set
int_test_loss, int_test_acc = int_model.evaluate(raw_test_ds)

## 87/87 [==============================] - 1s 11ms/step - loss: 0.2339 - accuracy: 0.9259
87/87 [==============================] - 1s 14ms/step - loss: 0.2118 - accuracy: 0.9259
# Print the test loss and accuracy
print("Test loss:", int_test_loss)
print("Test accuracy:", int_test_acc)
## Test loss: 0.23388110101222992
## Test accuracy: 0.9259259104728699
Test loss: 0.21181465685367584
Test accuracy: 0.9259259104728699
# binary_train_ds = raw_train_ds.map(lambda x,y: (binary_vectorize_layer(x), y))

clf_train_ds = raw_train_ds.map(lambda x,y: (multi_class_vectorize_layer(x), y))
clf_val_ds = raw_val_ds.map(lambda x,y: (multi_class_vectorize_layer(x), y))
clf_test_ds = raw_test_ds.map(lambda x,y: (multi_class_vectorize_layer(x), y))

int_train_ds = raw_train_ds.map(lambda x,y: (int_vectorize_layer(x), y))
int_val_ds = raw_val_ds.map(lambda x,y: (int_vectorize_layer(x), y))
int_test_ds = raw_test_ds.map(lambda x,y: (int_vectorize_layer(x), y))

Export the model

clf_model.export('bin.tf')
Saved artifact at 'bin.tf'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None,), dtype=tf.string, name='text_vectorization_input')
Output Type:
  TensorSpec(shape=(None, 25), dtype=tf.float32, name=None)
Captures:
  139832539029648: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139832326372544: TensorSpec(shape=(), dtype=tf.int64, name=None)
  139832326361280: TensorSpec(shape=(), dtype=tf.string, name=None)
  139832326361632: TensorSpec(shape=(), dtype=tf.int64, name=None)
  139832461989792: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139832329937216: TensorSpec(shape=(), dtype=tf.resource, name=None)
loaded = tf.saved_model.load('bin.tf')
clf_model.predict(['How do you sort a list?'])
1/1 [==============================] - 0s 145ms/step
array([[ 0.09657298, -0.49373376,  0.07736832,  0.06607725,  0.05221149,
        -0.46978113, -0.48686236, -0.55163133,  0.15191314,  0.23564513,
        -0.4884494 , -0.6964581 ,  0.24588017, -0.74412024, -0.48752603,
         0.3062069 , -0.8510176 , -1.0887347 , -0.4476996 , -0.05396813,
        -0.66050076, -0.6878822 , -0.61148196, -0.7419138 , -0.70286953]],
      dtype=float32)
loaded.serve(tf.constant(['How do you sort a list?'])).numpy()
array([[ 0.09657298, -0.49373376,  0.07736832,  0.06607725,  0.05221149,
        -0.46978113, -0.48686236, -0.55163133,  0.15191314,  0.23564513,
        -0.4884494 , -0.6964581 ,  0.24588017, -0.74412024, -0.48752603,
         0.3062069 , -0.8510176 , -1.0887347 , -0.4476996 , -0.05396813,
        -0.66050076, -0.6878822 , -0.61148196, -0.7419138 , -0.70286953]],
      dtype=float32)
clf_model.export('cpp_top_25_cwe_cnn_model.tf')
Saved artifact at 'cpp_top_25_cwe_cnn_model.tf'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None,), dtype=tf.string, name='text_vectorization_input')
Output Type:
  TensorSpec(shape=(None, 25), dtype=tf.float32, name=None)
Captures:
  139832539029648: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139832326372544: TensorSpec(shape=(), dtype=tf.int64, name=None)
  139832326361280: TensorSpec(shape=(), dtype=tf.string, name=None)
  139832326361632: TensorSpec(shape=(), dtype=tf.int64, name=None)
  139832461989792: TensorSpec(shape=(), dtype=tf.resource, name=None)
  139832329937216: TensorSpec(shape=(), dtype=tf.resource, name=None)
## cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
!tar -czf cpp_top_25_cwe_cnn_model.tf.tar.gz cpp_top_25_cwe_cnn_model.tf
!ls -lh
total 21M
drwxr-xr-x  4 root root 4.0K Jan  7 10:38 bin.tf
-rw-r--r--  1 root root 339K Jan  7 10:33 bow_transformer.pk
-rw-r--r--  1 root root 756K Jan  7 10:35 cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
drwxr-xr-x 27 root root 4.0K Jan  2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
-rw-r--r--  1 root root 875K Jan  7 10:35 _cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
drwxr-xr-x 12 root root 4.0K Jan  3 17:15 cpp_clean_files_top_10_cwe_omitted
-rw-r--r--  1 root root  15K Jan  7 10:35 cpp_clean_files_top_10_cwe_omitted.tar.gz
drwxr-xr-x  4 root root 4.0K Jan  7 10:38 cpp_top_25_cwe_cnn_model.tf
-rw-r--r--  1 root root 878K Jan  7 10:38 cpp_top_25_cwe_cnn_model.tf.tar.gz
-rw-r--r--  1 root root  11M Jan  7 10:32 data_drop_na.csv
-rw-r--r--  1 root root  52K Jan  7 10:36 model.png
-rw-r--r--  1 root root 4.9M Jan  7 10:35 nb_model.pk
-rw-r--r--  1 root root 2.9M Jan  7 10:35 sard.zip
drwxr-xr-x 27 root root 4.0K Jan  7 10:35 test
-rw-r--r--  1 root root 201K Jan  7 10:33 tfidf_transformer.pk
drwxr-xr-x 27 root root 4.0K Jan  7 10:35 train
# !wget https://raw.githubusercontent.com/c6ai/temp/main/cpp_top_25_cwe_cnn_model.tf.tar.gz
# !gdown 1j1nY2qlLnA_Iap0_ug8ZAQuDgKX1QIJB

from google.colab import files
# files.download("cpp_ready_8750_files_each_350_top_25_cwe_omitted.tar.gz")
# files.download("cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz")
files.download("cpp_top_25_cwe_cnn_model.tf.tar.gz")
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>
# ## extract or un-tar (unzip):
# !tar -xzf cpp_top_25_cwe_cnn_model.tf.tar.gz
# ## 1st ...
# from google.colab import drive
# drive.mount('/content/drive')
# !mkdir tar_gz_files
# !cp cpp_top_25_cwe_cnn_model.tf.tar.gz tar_gz_files
# !cp tar_gz_files/* /content/drive/MyDrive/1st-SHARED-Data
print("Tensorlfow version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("GPU is", "available" if tf.test.is_gpu_available() else "NOT AVAILABLE")

## Tensorlfow version:  2.13.1
## Eager mode:  True
## GPU is NOT AVAILABLE

## Tensorlfow version:  2.15.0
## Eager mode:  True
## T4 GPU is available
## High RAM
Tensorlfow version:  2.15.0
Eager mode:  True
GPU is NOT AVAILABLE

This may take few minutes

%%time

!tar -czf 1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz /content/*

## CPU times: user 611 ms, sys: 77.2 ms, total: 688 ms
## Wall time: 1min 25s
tar: Removing leading `/' from member names
tar: Removing leading `/' from hard link targets
CPU times: user 20.4 ms, sys: 3.04 ms, total: 23.4 ms
Wall time: 1.51 s
!ls -lh
total 31M
-rw-r--r--  1 root root 9.9M Jan  7 10:38 1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz
drwxr-xr-x  4 root root 4.0K Jan  7 10:38 bin.tf
-rw-r--r--  1 root root 339K Jan  7 10:33 bow_transformer.pk
-rw-r--r--  1 root root 756K Jan  7 10:35 cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
drwxr-xr-x 27 root root 4.0K Jan  2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
-rw-r--r--  1 root root 875K Jan  7 10:35 _cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
drwxr-xr-x 12 root root 4.0K Jan  3 17:15 cpp_clean_files_top_10_cwe_omitted
-rw-r--r--  1 root root  15K Jan  7 10:35 cpp_clean_files_top_10_cwe_omitted.tar.gz
drwxr-xr-x  4 root root 4.0K Jan  7 10:38 cpp_top_25_cwe_cnn_model.tf
-rw-r--r--  1 root root 878K Jan  7 10:38 cpp_top_25_cwe_cnn_model.tf.tar.gz
-rw-r--r--  1 root root  11M Jan  7 10:32 data_drop_na.csv
-rw-r--r--  1 root root  52K Jan  7 10:36 model.png
-rw-r--r--  1 root root 4.9M Jan  7 10:35 nb_model.pk
-rw-r--r--  1 root root 2.9M Jan  7 10:35 sard.zip
drwxr-xr-x 27 root root 4.0K Jan  7 10:35 test
-rw-r--r--  1 root root 201K Jan  7 10:33 tfidf_transformer.pk
drwxr-xr-x 27 root root 4.0K Jan  7 10:35 train
## import time
## global_start = time.time()

global_end = time.time()

# print("[T4 GPU & High RAM?] Global Time Duration: " + str(global_end - global_start))
print("Global Time Duration: " + str(global_end - global_start))

## [T4 GPU & High RAM] Global Time Duration: 643
## 643s / 60 ​is approximately 11 minutes
## 8 mins on CPU & High RAM

## Global Time Duration: 401.64417576789856
Global Time Duration: 401.64417576789856

This may take few minutes

%%time

## > 1GB !
# from google.colab import files
# files.download("1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz")

##
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.58 µs

As inspiered by our own Prof. Sadawi's Lecture Notes, Course-Work Template ** & related Notebooks; while further extending official TF2 documentation.

** Sadawi, N. (2021). Nsadawi/Advanced-ML-Projects Jupyter Notebook (Original work published 2020) [dc88adb9c256ae34c381a0d3533a586c5906aac8].


APPENDIX- A: Dataset Sourcing from Scratch

In-Sample Test-Cases for each CWE

API WARNING: Huge Size Download

!wget https://samate.nist.gov/SARD/downloads/test-suites/2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
--2024-01-07 10:38:18--  https://samate.nist.gov/SARD/downloads/test-suites/2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
Resolving samate.nist.gov (samate.nist.gov)... 129.6.13.19, 2610:20:6005:13::19
Connecting to samate.nist.gov (samate.nist.gov)|129.6.13.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 703487355 (671M) [application/zip]
Saving to: ‘2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip’

2022-08-11-juliet-c 100%[===================>] 670.90M  97.6MB/s    in 5.5s    

2024-01-07 10:38:24 (123 MB/s) - ‘2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip’ saved [703487355/703487355]

This may take few minutes

%%time

!unzip 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip -d data # /content/data

## CPU times: user 26.5 s, sys: 5.38 s, total: 31.8 s
## Wall time: 4min 3s
# Streaming output truncated to the last 5000 lines.
#   inflating: data/97419-v1.0.0/src/testcases/CWE36_Absolute_Path_Traversal/s04/CWE36_Absolute_Path_Traversal__wchar_t_environment_w32CreateFile_10.cpp
#    creating: data/247234-v2.0.0/
#   inflating: data/247234-v2.0.0/manifest.sarif
#   inflating: data/247234-v2.0.0/Dockerfile
#   inflating: data/247234-v2.0.0/Makefile
#  extracting: data/247234-v2.0.0/.dockerignore
#  extracting: data/247234-v2.0.0/.gitignore
#    creating: data/247234-v2.0.0/src/
#    creating: data/247234-v2.0.0/src/testcasesupport/
#   inflating: data/247234-v2.0.0/src/testcasesupport/std_testcase_io.h
#   inflating: data/247234-v2.0.0/src/testcasesupport/io.c
#   inflating: data/247234-v2.0.0/src/testcasesupport/std_testcase.h
#    creating: data/247234-v2.0.0/src/testcases/
#    creating: data/247234-v2.0.0/src/testcases/CWE78_OS_Command_Injection/
#    creating: data/247234-v2.0.0/src/testcases/CWE78_OS_Command_Injection/s06/
#   inflating: data/247234-v2.0.0/src/testcases/CWE78_OS_Command_Injection/s06/CWE78_OS_Command_Injection__wchar_t_console_w32_spawnvp_34.c
#    creating: data/111814-v1.0.0/
# ...
# ...
# Open the zip file
with zipfile.ZipFile('2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip', 'r') as zip_ref:
    # Get a list of all file names in the zip file
    all_files = zip_ref.namelist()
    # Calculate 1% of the total number of files
    one_percent_files = int(len(all_files) * 0.01)
    # Get the last 1% of the files
    last_one_percent_files = all_files[-one_percent_files:]
    # Extract only the last 1% of the files
    for file in last_one_percent_files:
        zip_ref.extract(file, '/content/data')

print("Extracted the last 1% of files to /content/data")
Extracted the last 1% of files to /content/data
# Initialize an empty list to store the file details
df_files = []

# Traverse the directory tree
for dirpath, dirs, files in os.walk('/content/data'):
    for filename in files:
        # Get the file address
        file_address = os.path.join(dirpath, filename)
        # Append the file name and address to the list
        df_files.append([filename, file_address])

# Convert the list to a dataframe
df_f = pd.DataFrame(df_files, columns=['File_Name', 'File_Address'])

# Save the dataframe to a csv file
df_f.to_csv('files_tree.csv', index=False)
# Initialize an empty list to store the file details
cpp_list = []

# Traverse the directory tree
for dirpath, dirs, files in os.walk('/content/data'):
    for filename in files:
        # Only process files with the '.cpp' extension
        if filename.endswith('.cpp'):
            # Get the file address
            file_address = os.path.join(dirpath, filename)
            # Append the file name and address to the list
            cpp_list.append([filename, file_address])

# Convert the list to a dataframe
cpp_df = pd.DataFrame(cpp_list, columns=['File_Name', 'File_Address'])

# Save the dataframe to a csv file
cpp_df.to_csv('cpp_files_tree.csv', index=False)
# cpp_list[0]
cpp_df.shape ## (411, 2) ## (46401, 2) ## (618185, 2)
# cpp_df.head()
(46401, 2)
# df_files.shape
cpp_df.head()
File_Name File_Address
0 CWE36_Absolute_Path_Traversal__wchar_t_connect... /content/data/96809-v1.0.0/src/testcases/CWE36...
1 CWE590_Free_Memory_Not_on_Heap__delete_struct_... /content/data/107819-v1.0.0/src/testcases/CWE5...
2 CWE23_Relative_Path_Traversal__char_connect_so... /content/data/89720-v1.0.0/src/testcases/CWE23...
3 CWE23_Relative_Path_Traversal__char_connect_so... /content/data/89720-v1.0.0/src/testcases/CWE23...
4 CWE23_Relative_Path_Traversal__char_connect_so... /content/data/89720-v1.0.0/src/testcases/CWE23...
# !mkdir /content/cpp_data
!mkdir -p /content/cpp_data

This may take few minutes

%%time

# !mkdir -p /content/cpp_data
!awk -F ',' '{if (NR!=1) print $2}' cpp_files_tree.csv | xargs -I {} cp {} /content/cpp_data/

## CPU times: user 417 ms, sys: 54.4 ms, total: 472 ms
## Wall time: 1min 10s
CPU times: user 929 ms, sys: 94 ms, total: 1.02 s
Wall time: 2min
# !mv /content/cpp_files_all_raw.zip /content/cpp_files_all_raw_partial.zip
# Initialize an empty set to store the unique prefixes
subfolders_set = set()

# Traverse the directory tree
for dirpath, dirs, files in os.walk('/content/cpp_data'):
    for filename in files:
        # Extract the prefix from the file name
        prefix = re.match(r'(.*)__', filename)
        if prefix:
            # Add the prefix to the set
            subfolders_set.add(prefix.group(1))

# Convert the set to a list
subfolders_list = list(subfolders_set)

print(subfolders_list)
['CWE675_Duplicate_Operations_on_Resource', 'CWE758_Undefined_Behavior', 'CWE123_Write_What_Where_Condition', 'CWE590_Free_Memory_Not_on_Heap', 'CWE690_NULL_Deref_From_Return', 'CWE126_Buffer_Overread', 'CWE195_Signed_to_Unsigned_Conversion_Error', 'CWE773_Missing_Reference_to_Active_File_Descriptor_or_Handle', 'CWE563_Unused_Variable', 'CWE606_Unchecked_Loop_Condition', 'CWE127_Buffer_Underread', 'CWE400_Resource_Exhaustion', 'CWE114_Process_Control', 'CWE762_Mismatched_Memory_Management_Routines', 'CWE416_Use_After_Free', 'CWE121_Stack_Based_Buffer_Overflow', 'CWE390_Error_Without_Action', 'CWE591_Sensitive_Data_Storage_in_Improperly_Locked_Memory', 'CWE396_Catch_Generic_Exception', 'CWE789_Uncontrolled_Mem_Alloc', 'CWE500_Public_Static_Field_Not_Final', 'CWE23_Relative_Path_Traversal', 'CWE476_NULL_Pointer_Dereference', 'CWE134_Uncontrolled_Format_String', 'CWE672_Operation_on_Resource_After_Expiration_or_Release', 'CWE562_Return_of_Stack_Variable_Address', 'CWE256_Plaintext_Storage_of_Password', 'CWE319_Cleartext_Tx_Sensitive_Info', 'CWE197_Numeric_Truncation_Error', 'CWE194_Unexpected_Sign_Extension', 'CWE397_Throw_Generic_Exception', 'CWE415_Double_Free', 'CWE90_LDAP_Injection', 'CWE457_Use_of_Uninitialized_Variable', 'CWE843_Type_Confusion', 'CWE369_Divide_by_Zero', 'CWE680_Integer_Overflow_to_Buffer_Overflow', 'CWE176_Improper_Handling_of_Unicode_Encoding', 'CWE404_Improper_Resource_Shutdown', 'CWE427_Uncontrolled_Search_Path_Element', 'CWE190_Integer_Overflow', 'CWE321_Hard_Coded_Cryptographic_Key', 'CWE122_Heap_Based_Buffer_Overflow', 'CWE468_Incorrect_Pointer_Scaling', 'CWE588_Attempt_to_Access_Child_of_Non_Structure_Pointer', 'CWE426_Untrusted_Search_Path', 'CWE401_Memory_Leak', 'CWE78_OS_Command_Injection', 'CWE124_Buffer_Underwrite', 'CWE191_Integer_Underflow', 'CWE775_Missing_Release_of_File_Descriptor_or_Handle', 'CWE259_Hard_Coded_Password', 'CWE665_Improper_Initialization', 'CWE676_Use_of_Potentially_Dangerous_Function', 'CWE15_External_Control_of_System_or_Configuration_Setting', 'CWE440_Expected_Behavior_Violation', 'CWE464_Addition_of_Data_Structure_Sentinel', 'CWE617_Reachable_Assertion', 'CWE36_Absolute_Path_Traversal', 'CWE761_Free_Pointer_Not_at_Start_of_Buffer']
num_items = len(subfolders_list)
print(num_items)
## 42 ## 60
60
# Initialize an empty list to store the prefixes
prefixes = []

# Traverse the directory tree
for dirpath, dirs, files in os.walk('/content/cpp_data'):
    for filename in files:
        # Extract the prefix from the file name
        prefix = re.match(r'(.*)__', filename)
        if prefix:
            # Add the prefix to the list
            prefixes.append(prefix.group(1))

# Count the number of files for each prefix
prefix_counts = collections.Counter(prefixes)

# Convert the counter to a dataframe
cpp_folders_count_df = pd.DataFrame.from_dict(prefix_counts, orient='index').reset_index()
cpp_folders_count_df.columns = ['CWE-ID', 'Files-Count']

print(cpp_folders_count_df)
                                               CWE-ID  Files-Count
0                       CWE23_Relative_Path_Traversal         3900
1                  CWE121_Stack_Based_Buffer_Overflow         1965
2                        CWE426_Untrusted_Search_Path           88
3                                  CWE401_Memory_Leak         1476
4        CWE762_Mismatched_Memory_Management_Routines         6092
5                       CWE690_NULL_Deref_From_Return          440
6                                  CWE415_Double_Free         1308
7                   CWE122_Heap_Based_Buffer_Overflow         5974
8                              CWE126_Buffer_Overread          912
9                       CWE36_Absolute_Path_Traversal         3900
10                            CWE127_Buffer_Underread         1416
11                         CWE400_Resource_Exhaustion          390
12                  CWE134_Uncontrolled_Format_String         1560
13                           CWE124_Buffer_Underwrite         1416
14                     CWE590_Free_Memory_Not_on_Heap         3321
15                      CWE789_Uncontrolled_Mem_Alloc         1080
16                         CWE78_OS_Command_Injection         2200
17                            CWE190_Integer_Overflow         1404
18                    CWE476_NULL_Pointer_Dereference          158
19         CWE195_Signed_to_Unsigned_Conversion_Error          528
20                             CWE114_Process_Control          264
21                         CWE617_Reachable_Assertion          132
22         CWE680_Integer_Overflow_to_Buffer_Overflow          600
23                              CWE416_Use_After_Free          370
24                    CWE606_Unchecked_Loop_Condition          260
25                               CWE90_LDAP_Injection          220
26            CWE675_Duplicate_Operations_on_Resource          104
27               CWE457_Use_of_Uninitialized_Variable          463
28                           CWE191_Integer_Underflow          858
29                  CWE404_Improper_Resource_Shutdown          176
30                             CWE563_Unused_Variable          388
31       CWE676_Use_of_Potentially_Dangerous_Function           18
32  CWE591_Sensitive_Data_Storage_in_Improperly_Lo...           44
33                              CWE369_Divide_by_Zero          468
34            CWE427_Uncontrolled_Search_Path_Element          220
35         CWE761_Free_Pointer_Not_at_Start_of_Buffer          264
36                   CWE194_Unexpected_Sign_Extension          528
37                        CWE390_Error_Without_Action           18
38                    CWE197_Numeric_Truncation_Error          396
39  CWE588_Attempt_to_Access_Child_of_Non_Structur...           76
40  CWE773_Missing_Reference_to_Active_File_Descri...           66
41                     CWE396_Catch_Generic_Exception           54
42  CWE775_Missing_Release_of_File_Descriptor_or_H...           66
43                     CWE665_Improper_Initialization           90
44                CWE321_Hard_Coded_Cryptographic_Key           44
45  CWE15_External_Control_of_System_or_Configurat...           22
46                          CWE758_Undefined_Behavior          216
47                 CWE319_Cleartext_Tx_Sensitive_Info          104
48  CWE672_Operation_on_Resource_After_Expiration_...           81
49                              CWE843_Type_Confusion           26
50                  CWE123_Write_What_Where_Condition           66
51               CWE256_Plaintext_Storage_of_Password           52
52                         CWE259_Hard_Coded_Password           44
53         CWE464_Addition_of_Data_Structure_Sentinel           22
54       CWE176_Improper_Handling_of_Unicode_Encoding           26
55                     CWE397_Throw_Generic_Exception           20
56                   CWE468_Incorrect_Pointer_Scaling            1
57                 CWE440_Expected_Behavior_Violation            1
58               CWE500_Public_Static_Field_Not_Final            2
59            CWE562_Return_of_Stack_Variable_Address            1
# cpp_cwe_top_25_df = cpp_folders_count_df.nlargest(25, 'Files-Count')
cpp_cwe_top_10_df = cpp_folders_count_df.nlargest(10, 'Files-Count')
print(cpp_cwe_top_10_df)
                                          CWE-ID  Files-Count
4   CWE762_Mismatched_Memory_Management_Routines         6092
7              CWE122_Heap_Based_Buffer_Overflow         5974
0                  CWE23_Relative_Path_Traversal         3900
9                  CWE36_Absolute_Path_Traversal         3900
14                CWE590_Free_Memory_Not_on_Heap         3321
16                    CWE78_OS_Command_Injection         2200
1             CWE121_Stack_Based_Buffer_Overflow         1965
12             CWE134_Uncontrolled_Format_String         1560
3                             CWE401_Memory_Leak         1476
10                       CWE127_Buffer_Underread         1416
# %%time
# Create the root folder
os.makedirs('/content/cpp_folders', exist_ok=True)

# For each CWE-ID in the top 10/25
for index, row in cpp_cwe_top_10_df.iterrows():
    cwe_id = row['CWE-ID']
    files_count = row['Files-Count']

    # If the files count is greater than 10/350
    # if files_count > 350:
    if files_count > 10:
        # Create a subfolder for the CWE-ID
        os.makedirs(f'/content/cpp_folders/{cwe_id}', exist_ok=True)

        # Get the list of files for the CWE-ID
        files = [f for f in os.listdir('/content/cpp_data') if f.startswith(cwe_id + '__')]

        # Copy the initial 10/350 files
        # for file in files[:350]:
        for file in files[:10]:
            shutil.copy(os.path.join('/content/cpp_data', file), f'/content/cpp_folders/{cwe_id}')
## 2s
        # Copy the initial 10/350 files
        # for file in files[:350]:
        for file in files[:10]:
            shutil.copy(os.path.join('/content/cpp_data', file), f'/content/cpp_folders/{cwe_id}')
## CPU times: user 745 ms, sys: 1.5 s, total: 2.25 s
# Wall time: 2.27 s
# %%time
# Initialize an empty list to store the file details
file_details = []

# Traverse the directory tree
for dirpath, dirs, files in os.walk('/content/cpp_folders'):
    for filename in files:
        # Extract the CWE-ID from the directory path
        cwe_id = os.path.basename(dirpath)
        # Extract the file short name from the file name
        file_short_name = re.match(r'.*__(.*)\.cpp', filename)
        if file_short_name:
            file_short_name = file_short_name.group(1)
        # Append the file details to the list
        file_details.append([cwe_id, file_short_name, filename])

# Convert the list to a dataframe
cpp_cwe_top_10_list_df = pd.DataFrame(file_details, columns=['CWE-ID', 'File-Short-Name', 'File-Full-Name'])

print(cpp_cwe_top_10_list_df.head())
## CPU times: user 43.9 ms, sys: 3.43 ms, total: 47.3 ms
## Wall time: 49.4 ms
                              CWE-ID                      File-Short-Name  \
0  CWE122_Heap_Based_Buffer_Overflow       cpp_CWE805_wchar_t_loop_84_bad   
1  CWE122_Heap_Based_Buffer_Overflow  cpp_CWE806_char_snprintf_81_goodG2B   
2  CWE122_Heap_Based_Buffer_Overflow                   c_src_char_cpy_72b   
3  CWE122_Heap_Based_Buffer_Overflow                c_CWE129_fgets_81_bad   
4  CWE122_Heap_Based_Buffer_Overflow       cpp_CWE805_wchar_t_memmove_68b   

                                      File-Full-Name  
0  CWE122_Heap_Based_Buffer_Overflow__cpp_CWE805_...  
1  CWE122_Heap_Based_Buffer_Overflow__cpp_CWE806_...  
2  CWE122_Heap_Based_Buffer_Overflow__c_src_char_...  
3  CWE122_Heap_Based_Buffer_Overflow__c_CWE129_fg...  
4  CWE122_Heap_Based_Buffer_Overflow__cpp_CWE805_...  
# cpp_cwe_top_10_list_df = cpp_cwe_top_10_350_files_list_df
cpp_cwe_top_10_list_df.shape ## (100, 3)
(100, 3)
# print(cpp_cwe_top_10_350_files_list_df) ## [100 rows x 3 columns] ## [8750 rows x 3 columns]
# 8750/350 ## 25
# print(cpp_cwe_top_25_350_files_list_df) ## [8750 rows x 3 columns]
cpp_cwe_top_10_list_df ## 100 rows × 3 columns ## [8750 rows x 3 columns]
CWE-ID File-Short-Name File-Full-Name
0 CWE122_Heap_Based_Buffer_Overflow cpp_CWE805_wchar_t_loop_84_bad CWE122_Heap_Based_Buffer_Overflow__cpp_CWE805_...
1 CWE122_Heap_Based_Buffer_Overflow cpp_CWE806_char_snprintf_81_goodG2B CWE122_Heap_Based_Buffer_Overflow__cpp_CWE806_...
2 CWE122_Heap_Based_Buffer_Overflow c_src_char_cpy_72b CWE122_Heap_Based_Buffer_Overflow__c_src_char_...
3 CWE122_Heap_Based_Buffer_Overflow c_CWE129_fgets_81_bad CWE122_Heap_Based_Buffer_Overflow__c_CWE129_fg...
4 CWE122_Heap_Based_Buffer_Overflow cpp_CWE805_wchar_t_memmove_68b CWE122_Heap_Based_Buffer_Overflow__cpp_CWE805_...
... ... ... ...
95 CWE134_Uncontrolled_Format_String char_file_w32_vsnprintf_82a CWE134_Uncontrolled_Format_String__char_file_w...
96 CWE134_Uncontrolled_Format_String char_environment_vfprintf_84_goodB2G CWE134_Uncontrolled_Format_String__char_enviro...
97 CWE134_Uncontrolled_Format_String wchar_t_file_printf_84_bad CWE134_Uncontrolled_Format_String__wchar_t_fil...
98 CWE134_Uncontrolled_Format_String wchar_t_environment_w32_vsnprintf_84a CWE134_Uncontrolled_Format_String__wchar_t_env...
99 CWE134_Uncontrolled_Format_String wchar_t_connect_socket_printf_74b CWE134_Uncontrolled_Format_String__wchar_t_con...

100 rows × 3 columns

print(cpp_cwe_top_10_list_df['File-Full-Name'][2]) ## CWE122_Heap_Based_Buffer_Overflow__c_CWE129_fgets_62a.cpp
CWE122_Heap_Based_Buffer_Overflow__c_src_char_cpy_72b.cpp

...

# cpp_cwe_top_25_350_files_list_df['File-Full-Name'] ## [8750 rows x 3 columns]
print(cpp_cwe_top_10_list_df['File-Full-Name'][2]) ## [8750 rows x 3 columns]
CWE122_Heap_Based_Buffer_Overflow__c_src_char_cpy_72b.cpp
cpp_files_top_10_cwe_list_df = cpp_cwe_top_10_list_df.copy()
!cp -r cpp_folders cpp_files_top_10_cwe
!zip cpp_files_top_10_cwe.zip cpp_files_top_10_cwe
  adding: cpp_files_top_10_cwe/ (stored 0%)
!mv cpp_files_top_10_cwe.zip temp.zip
!tar -czf cpp_files_top_10_cwe.tar.gz cpp_files_top_10_cwe
!ls -lh
total 767M
-rw-r--r--     1 root root 9.9M Jan  7 10:38 1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz
-rw-r--r--     1 root root 671M Aug 11  2022 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
drwxr-xr-x     4 root root 4.0K Jan  7 10:38 bin.tf
-rw-r--r--     1 root root 339K Jan  7 10:33 bow_transformer.pk
-rw-r--r--     1 root root 756K Jan  7 10:35 cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
drwxr-xr-x    27 root root 4.0K Jan  2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
-rw-r--r--     1 root root 875K Jan  7 10:35 _cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
drwxr-xr-x    12 root root 4.0K Jan  3 17:15 cpp_clean_files_top_10_cwe_omitted
-rw-r--r--     1 root root  15K Jan  7 10:35 cpp_clean_files_top_10_cwe_omitted.tar.gz
drwxr-xr-x     2 root root 4.6M Jan  7 10:46 cpp_data
drwxr-xr-x    12 root root 4.0K Jan  7 10:46 cpp_files_top_10_cwe
-rw-r--r--     1 root root  29K Jan  7 10:46 cpp_files_top_10_cwe.tar.gz
-rw-r--r--     1 root root 9.1M Jan  7 10:44 cpp_files_tree.csv
drwxr-xr-x    12 root root 4.0K Jan  7 10:46 cpp_folders
drwxr-xr-x     4 root root 4.0K Jan  7 10:38 cpp_top_25_cwe_cnn_model.tf
-rw-r--r--     1 root root 878K Jan  7 10:38 cpp_top_25_cwe_cnn_model.tf.tar.gz
drwxr-xr-x 64101 root root 2.1M Jan  7 10:43 data
-rw-r--r--     1 root root  11M Jan  7 10:32 data_drop_na.csv
-rw-r--r--     1 root root  49M Jan  7 10:44 files_tree.csv
-rw-r--r--     1 root root  52K Jan  7 10:36 model.png
-rw-r--r--     1 root root 4.9M Jan  7 10:35 nb_model.pk
-rw-r--r--     1 root root 2.9M Jan  7 10:35 sard.zip
-rw-r--r--     1 root root  192 Jan  7 10:46 temp.zip
drwxr-xr-x    27 root root 4.0K Jan  7 10:35 test
-rw-r--r--     1 root root 201K Jan  7 10:33 tfidf_transformer.pk
drwxr-xr-x    27 root root 4.0K Jan  7 10:35 train
# !tar -czf cpp_files_top_10_cwe.tar.gz cpp_files_top_10_cwe
# # %%time
# # !tar -czf cpp_files_raw.tar.gz -C /content/cpp_data .
# !tar -czf cpp_all_files_raw.tar.gz -C /content/cpp_data .
# ## CPU times: user 36.3 ms, sys: 3.97 ms, total: 40.3 ms
# ## Wall time: 4.42 s
!ls -lh
total 767M
-rw-r--r--     1 root root 9.9M Jan  7 10:38 1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz
-rw-r--r--     1 root root 671M Aug 11  2022 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
drwxr-xr-x     4 root root 4.0K Jan  7 10:38 bin.tf
-rw-r--r--     1 root root 339K Jan  7 10:33 bow_transformer.pk
-rw-r--r--     1 root root 756K Jan  7 10:35 cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
drwxr-xr-x    27 root root 4.0K Jan  2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
-rw-r--r--     1 root root 875K Jan  7 10:35 _cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
drwxr-xr-x    12 root root 4.0K Jan  3 17:15 cpp_clean_files_top_10_cwe_omitted
-rw-r--r--     1 root root  15K Jan  7 10:35 cpp_clean_files_top_10_cwe_omitted.tar.gz
drwxr-xr-x     2 root root 4.6M Jan  7 10:46 cpp_data
drwxr-xr-x    12 root root 4.0K Jan  7 10:46 cpp_files_top_10_cwe
-rw-r--r--     1 root root  29K Jan  7 10:46 cpp_files_top_10_cwe.tar.gz
-rw-r--r--     1 root root 9.1M Jan  7 10:44 cpp_files_tree.csv
drwxr-xr-x    12 root root 4.0K Jan  7 10:46 cpp_folders
drwxr-xr-x     4 root root 4.0K Jan  7 10:38 cpp_top_25_cwe_cnn_model.tf
-rw-r--r--     1 root root 878K Jan  7 10:38 cpp_top_25_cwe_cnn_model.tf.tar.gz
drwxr-xr-x 64101 root root 2.1M Jan  7 10:43 data
-rw-r--r--     1 root root  11M Jan  7 10:32 data_drop_na.csv
-rw-r--r--     1 root root  49M Jan  7 10:44 files_tree.csv
-rw-r--r--     1 root root  52K Jan  7 10:36 model.png
-rw-r--r--     1 root root 4.9M Jan  7 10:35 nb_model.pk
-rw-r--r--     1 root root 2.9M Jan  7 10:35 sard.zip
-rw-r--r--     1 root root  192 Jan  7 10:46 temp.zip
drwxr-xr-x    27 root root 4.0K Jan  7 10:35 test
-rw-r--r--     1 root root 201K Jan  7 10:33 tfidf_transformer.pk
drwxr-xr-x    27 root root 4.0K Jan  7 10:35 train
## extract or un-tar (unzip):
# !tar -xzf cpp_files_raw.tar.gz
# Copy the cpp_folders directory to cpp_clean_folders
shutil.copytree('cpp_folders', 'cpp_clean_folders')

# Traverse the directory tree
for dirpath, dirs, files in os.walk('cpp_clean_folders'):
    for filename in files:
        if filename.endswith('.cpp'):
            # Get the file path
            file_path = os.path.join(dirpath, filename)
            with open(file_path, 'r') as file:
                file_data = file.read()
            # Remove the comments
            file_data = re.sub(r'/\*.*?\*/', '', file_data, flags=re.DOTALL)
            with open(file_path, 'w') as file:
                file.write(file_data)
!cp -r cpp_clean_folders cpp_clean_files_top_10_cwe
!tar -czf cpp_clean_files_top_10_cwe.tar.gz cpp_clean_files_top_10_cwe
# Copy the cpp_folders directory to cpp_clean_folders
shutil.copytree('cpp_data', 'cpp_clean_data')

# Traverse the directory tree
for dirpath, dirs, files in os.walk('cpp_clean_data'):
    for filename in files:
        if filename.endswith('.cpp'):
            # Get the file path
            file_path = os.path.join(dirpath, filename)
            with open(file_path, 'r') as file:
                file_data = file.read()
            # Remove the comments
            file_data = re.sub(r'/\*.*?\*/', '', file_data, flags=re.DOTALL)
            with open(file_path, 'w') as file:
                file.write(file_data)
# # %%time
# # !tar -czf cpp_files_raw.tar.gz -C /content/cpp_data .
# !tar -czf cpp_all_files_clean.tar.gz -C /content/cpp_clean_data .
# ## CPU times: user 36.3 ms, sys: 3.97 ms, total: 40.3 ms
# ## Wall time: 4.42 s
!cp cpp_all_files_clean.tar.gz cpp_all_clean_files_w_top_10_cwe_titles.tar.gz
cp: cannot stat 'cpp_all_files_clean.tar.gz': No such file or directory
!mkdir tar_gz_files
!cp *.gz tar_gz_files
# from google.colab import drive
# drive.mount('/content/drive')
 ## /content/tar_gz_files/cpp_clean_files_top_10_cwe.tar.gz
# !cp tar_gz_files/* /content/drive/MyDrive/1st-SHARED-Data
# Copy the cpp_clean_folders directory to cpp_clean_omitted_cwe_folders
shutil.copytree('cpp_clean_folders', 'cpp_clean_omitted_cwe_folders')

# Traverse the directory tree
for dirpath, dirs, files in os.walk('cpp_clean_omitted_cwe_folders'):
    for filename in files:
        if filename.endswith('.cpp'):
            # Extract the CWE-ID from the file name
            cwe_id = re.match(r'(.*?)__', filename)
            if cwe_id:
                cwe_id = cwe_id.group(1)
                # Get the file path
                file_path = os.path.join(dirpath, filename)
                with open(file_path, 'r') as file:
                    file_data = file.read()
                # Remove the CWE-ID and only one single word from the file data
                file_data = re.sub(cwe_id + '__' + r'\w+?_', '', file_data)
                with open(file_path, 'w') as file:
                    file.write(file_data)
                # Remove the CWE-ID and only one single word from the file name
                new_filename = re.sub(cwe_id + '__' + r'\w+?_', '', filename)
                os.rename(file_path, os.path.join(dirpath, new_filename))
!cp -r cpp_clean_omitted_cwe_folders cpp_clean_files_top_10_cwe_omitted
!tar -czf cpp_clean_files_top_10_cwe_omitted.tar.gz cpp_clean_files_top_10_cwe_omitted
!cp cpp_clean_files_top_10_cwe_omitted.tar.gz tar_gz_files
# ## cpp_clean_files_top_10_cwe_omitted.tar.gz
# !cp tar_gz_files/* /content/drive/MyDrive/1st-SHARED-Data
# from google.colab import files
# files.download("cpp_clean_files_top_10_cwe_omitted.tar.gz")
# ## cpp_clean_files_top_10_cwe_omitted.tar.gz
# !gdown 1YQHdd457W4NjuTvJYiucKUr8pRwbGulj
# /content/cpp_clean_folders/CWE121_Stack_Based_Buffer_Overflow/CWE121_Stack_Based_Buffer_Overflow__CWE129_connect_socket_62a.cpp
## cpp_files_all_raw.zip

...

%%time
# # # !cd /content/cpp_data && zip /content/cpp_data.zip *.cpp
# # # !cd /content/cpp_data && zip /content/cpp_files_raw.zip *
# # !zip cpp_files_raw.zip /content/cpp_data/*
# # >>>
# # /bin/bash: line 1: /usr/bin/zip: Argument list too long

# # !cd /content/cpp_data && find . -type f -name "*.cpp" -exec zip /content/cpp_files_raw.zip {} \;
# !cd /content/cpp_data && find . -type f -exec zip /content/cpp_files_all_raw.zip {} \;

# ##
CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 5.25 µs
# # !ls /content/cpp_data/*.cpp | wc -l ## Argument list too long
# !ls /content/cpp_data/ | wc -l ## 411 ## 46399
# 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip  cpp_data
# cpp_all_clean_files_w_top_10_cwe_titles.tar.gz		     cpp_files_all_raw.zip
# cpp_all_files_clean.tar.gz				     cpp_files_top_10_cwe
# cpp_all_files_raw.tar.gz				     cpp_files_top_10_cwe.tar.gz
# cpp_clean_data						     cpp_files_tree.csv
# cpp_clean_files_top_10_cwe				     cpp_folders
# cpp_clean_files_top_10_cwe_omitted			     data
# cpp_clean_files_top_10_cwe_omitted.tar.gz		     files_tree.csv
# cpp_clean_files_top_10_cwe.tar.gz			     sample_data
# cpp_clean_folders					     tar_gz_files
# cpp_clean_omitted_cwe_folders				     temp.zip
# !ls
cpp_list[1]
# cpp_df[:1]
['CWE590_Free_Memory_Not_on_Heap__delete_struct_declare_04.cpp',
 '/content/data/107819-v1.0.0/src/testcases/CWE590_Free_Memory_Not_on_Heap/s03/CWE590_Free_Memory_Not_on_Heap__delete_struct_declare_04.cpp']
# cpp_list[0]
# cpp_df[:1]
cpp_df[1:]
File_Name File_Address
1 CWE590_Free_Memory_Not_on_Heap__delete_struct_... /content/data/107819-v1.0.0/src/testcases/CWE5...
2 CWE23_Relative_Path_Traversal__char_connect_so... /content/data/89720-v1.0.0/src/testcases/CWE23...
3 CWE23_Relative_Path_Traversal__char_connect_so... /content/data/89720-v1.0.0/src/testcases/CWE23...
4 CWE23_Relative_Path_Traversal__char_connect_so... /content/data/89720-v1.0.0/src/testcases/CWE23...
5 CWE773_Missing_Reference_to_Active_File_Descri... /content/data/116798-v1.0.0/src/testcases/CWE7...
... ... ...
46396 CWE134_Uncontrolled_Format_String__wchar_t_fil... /content/data/81532-v1.0.0/src/testcases/CWE13...
46397 CWE134_Uncontrolled_Format_String__char_consol... /content/data/79670-v1.0.0/src/testcases/CWE13...
46398 CWE134_Uncontrolled_Format_String__char_consol... /content/data/79670-v1.0.0/src/testcases/CWE13...
46399 CWE134_Uncontrolled_Format_String__char_consol... /content/data/79670-v1.0.0/src/testcases/CWE13...
46400 CWE134_Uncontrolled_Format_String__char_consol... /content/data/79670-v1.0.0/src/testcases/CWE13...

46400 rows × 2 columns

# !gdown --id 1lvT5f-jOADyy2gjjnirzwyMSgNsf9Bz9
# !wget https://samate.nist.gov/SARD/downloads/test-suites/2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
# ## 2022 Juliet C/C++ 1.3.1 with extra support
# https://samate.nist.gov/SARD/test-suites/116
# https://samate.nist.gov/SARD/test-suites/112
# !!!!!!!!!!!!!!1st-C6AI-AIxCC-NLP-Clf-Vulns-Threat-CWE-IDs @ NIST SARD Juliet C++
# https://samate.nist.gov/SARD/test-suites

# from google.colab import drive
# drive.mount('/content/drive')
# !cp 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip /content/drive/MyDrive/1st-SHARED-Data
# !rm -r sample_data
# !mkdir data
# # !gdown --id 1lvT5f-jOADyy2gjjnirzwyMSgNsf9Bz9
# # !wget https://samate.nist.gov/SARD/downloads/test-suites/2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip

# from google.colab import drive
# drive.mount('/content/drive')
# # !cp 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip /content/drive/MyDrive/1st-SHARED-Data
# !cp /content/drive/MyDrive/1st-SHARED-Data/2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip .
cpp_df
File_Name File_Address
0 CWE36_Absolute_Path_Traversal__wchar_t_connect... /content/data/96809-v1.0.0/src/testcases/CWE36...
1 CWE590_Free_Memory_Not_on_Heap__delete_struct_... /content/data/107819-v1.0.0/src/testcases/CWE5...
2 CWE23_Relative_Path_Traversal__char_connect_so... /content/data/89720-v1.0.0/src/testcases/CWE23...
3 CWE23_Relative_Path_Traversal__char_connect_so... /content/data/89720-v1.0.0/src/testcases/CWE23...
4 CWE23_Relative_Path_Traversal__char_connect_so... /content/data/89720-v1.0.0/src/testcases/CWE23...
... ... ...
46396 CWE134_Uncontrolled_Format_String__wchar_t_fil... /content/data/81532-v1.0.0/src/testcases/CWE13...
46397 CWE134_Uncontrolled_Format_String__char_consol... /content/data/79670-v1.0.0/src/testcases/CWE13...
46398 CWE134_Uncontrolled_Format_String__char_consol... /content/data/79670-v1.0.0/src/testcases/CWE13...
46399 CWE134_Uncontrolled_Format_String__char_consol... /content/data/79670-v1.0.0/src/testcases/CWE13...
46400 CWE134_Uncontrolled_Format_String__char_consol... /content/data/79670-v1.0.0/src/testcases/CWE13...

46401 rows × 2 columns

!ls -lh cpp_clean_files_top_10_cwe_omitted ## /content/cpp_clean_files_top_10_cwe_omitted
total 44K
drwxr-xr-x 12 root root 4.0K Jan  7 10:47 cpp_clean_omitted_cwe_folders
drwxr-xr-x  2 root root 4.0K Jan  3 17:15 CWE121_Stack_Based_Buffer_Overflow
drwxr-xr-x  2 root root 4.0K Jan  3 17:15 CWE122_Heap_Based_Buffer_Overflow
drwxr-xr-x  2 root root 4.0K Jan  3 17:15 CWE127_Buffer_Underread
drwxr-xr-x  2 root root 4.0K Jan  3 17:15 CWE134_Uncontrolled_Format_String
drwxr-xr-x  2 root root 4.0K Jan  3 17:15 CWE23_Relative_Path_Traversal
drwxr-xr-x  2 root root 4.0K Jan  3 17:15 CWE36_Absolute_Path_Traversal
drwxr-xr-x  2 root root 4.0K Jan  3 17:15 CWE401_Memory_Leak
drwxr-xr-x  2 root root 4.0K Jan  3 17:15 CWE415_Double_Free
drwxr-xr-x  2 root root 4.0K Jan  3 17:15 CWE590_Free_Memory_Not_on_Heap
drwxr-xr-x  2 root root 4.0K Jan  3 17:15 CWE762_Mismatched_Memory_Management_Routines
!ls -lh
total 773M
-rw-r--r--     1 root root 9.9M Jan  7 10:38 1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz
-rw-r--r--     1 root root 671M Aug 11  2022 2022-08-11-juliet-c-cplusplus-v1-3-1-with-extra-support.zip
drwxr-xr-x     4 root root 4.0K Jan  7 10:38 bin.tf
-rw-r--r--     1 root root 339K Jan  7 10:33 bow_transformer.pk
-rw-r--r--     1 root root 756K Jan  7 10:35 cpp_6102_train_2781_test_top_25_cwe_ready.tar.gz
drwxr-xr-x     2 root root 6.5M Jan  7 10:46 cpp_clean_data
drwxr-xr-x    27 root root 4.0K Jan  2 04:10 cpp_cleaner_8750_files_each_350_top_25_cwe_omitted
-rw-r--r--     1 root root 875K Jan  7 10:35 _cpp_cleaner_8750_files_each_350_top_25_cwe_omitted.tar.gz
drwxr-xr-x    12 root root 4.0K Jan  7 10:46 cpp_clean_files_top_10_cwe
drwxr-xr-x    13 root root 4.0K Jan  7 10:47 cpp_clean_files_top_10_cwe_omitted
-rw-r--r--     1 root root  31K Jan  7 10:47 cpp_clean_files_top_10_cwe_omitted.tar.gz
-rw-r--r--     1 root root  18K Jan  7 10:46 cpp_clean_files_top_10_cwe.tar.gz
drwxr-xr-x    12 root root 4.0K Jan  7 10:46 cpp_clean_folders
drwxr-xr-x    12 root root 4.0K Jan  7 10:46 cpp_clean_omitted_cwe_folders
drwxr-xr-x     2 root root 4.6M Jan  7 10:46 cpp_data
drwxr-xr-x    12 root root 4.0K Jan  7 10:46 cpp_files_top_10_cwe
-rw-r--r--     1 root root  29K Jan  7 10:46 cpp_files_top_10_cwe.tar.gz
-rw-r--r--     1 root root 9.1M Jan  7 10:44 cpp_files_tree.csv
drwxr-xr-x    12 root root 4.0K Jan  7 10:46 cpp_folders
drwxr-xr-x     4 root root 4.0K Jan  7 10:38 cpp_top_25_cwe_cnn_model.tf
-rw-r--r--     1 root root 878K Jan  7 10:38 cpp_top_25_cwe_cnn_model.tf.tar.gz
drwxr-xr-x 64101 root root 2.1M Jan  7 10:43 data
-rw-r--r--     1 root root  11M Jan  7 10:32 data_drop_na.csv
-rw-r--r--     1 root root  49M Jan  7 10:44 files_tree.csv
-rw-r--r--     1 root root  52K Jan  7 10:36 model.png
-rw-r--r--     1 root root 4.9M Jan  7 10:35 nb_model.pk
-rw-r--r--     1 root root 2.9M Jan  7 10:35 sard.zip
drwxr-xr-x     2 root root 4.0K Jan  7 10:47 tar_gz_files
-rw-r--r--     1 root root  192 Jan  7 10:46 temp.zip
drwxr-xr-x    27 root root 4.0K Jan  7 10:35 test
-rw-r--r--     1 root root 201K Jan  7 10:33 tfidf_transformer.pk
drwxr-xr-x    27 root root 4.0K Jan  7 10:35 train
## import time
## global_start = time.time()

global_end = time.time()

# print("[T4 GPU & High RAM?] Global Time Duration: " + str(global_end - global_start))
print("Global Time Duration: " + str(global_end - global_start))

## [T4 GPU & High RAM] Global Time Duration: 643
## 643s / 60 ​is approximately 11 minutes
## 8 mins on CPU & High RAM

## Global Time Duration: 940.3888688087463
Global Time Duration: 1051.4102718830109

This may take few minutes

%%time

## > 1GB !
# from google.colab import files
# files.download("1st_nlp_text_clf_cpp_top_cwe_v240104a.tar.gz")

##
CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 7.87 µs

1st UoL DSM140 CW - End

# 210107005_UoL_DSM140_NLP_Text_Classification_CW_Sub_v240107wk.ipynb
# Commentable @ https://colab.research.google.com/drive/1kUTphSV9lHhbu_HT_tvffIPEtFWpFPIg?usp=sharing