[DeFP]

Ranking Warnings of Static Analysis Tools Using Representation Learning

Static analysis tools are frequently used to detect potential vulnerabilities in software systems. However, an inevitable problem of these tools is their large number of warnings with a high false positive rate, which consumes time and effort for investigating. In this paper, we present DeFP, a novel method for ranking static analysis warnings. Based on the intuition that warnings which have similar contexts tend to have similar labels (true positive or false positive), DeFP is built with two BiLSTM models to capture the patterns associated with the contexts of labeled warnings. After that, for a set of new warnings, DeFP can calculate and rank them according to their likelihoods to be true positives (i.e., actual vulnerabilities). Our experimental results on a dataset of 10 real-world projects show that using DeFP, by investigating only 60% of the warnings, developers can find +90% of actual vulnerabilities. Moreover, DeFP improves the state-of-the-art approach 30% in both precision and recall. [Preprint]

Motivating Example
Model
Identifier Abstraction Component
Dataset
Experimental Results
References

Motivating Example

An [False Positive] Buffer Overflow warning reported by Flawfinder at line 1463 (corresponds to line 24 in the paper’s example) [Link]

....|
 static const char *aoc_rate_type_str(enum ast_aoc_s_rate_type value)
 {
     const char *str;
 
     switch (value) {
     default:
     case AST_AOC_RATE_TYPE_NA:
         str = "NotAvailable";
         break;
     case AST_AOC_RATE_TYPE_FREE:
         str = "Free";
         break;
     case AST_AOC_RATE_TYPE_FREE_FROM_BEGINNING:
         str = "FreeFromBeginning";
         break;
     case AST_AOC_RATE_TYPE_DURATION:
         str = "Duration";
         break;
     case AST_AOC_RATE_TYPE_FLAT:
         str = "Flat";
         break;
     case AST_AOC_RATE_TYPE_VOLUME:
         str = "Volume";
         break;
     case AST_AOC_RATE_TYPE_SPECIAL_CODE:
         str = "SpecialCode";
         break;
     }
     return str;
 }
....|
 
 static void aoc_s_event(const struct ast_aoc_decoded *decoded, struct ast_str **msg)
 {
     const char *rate_str;
     char prefix[32];
     int idx;
 
     ast_str_append(msg, 0, "NumberRates: %d\r\n", decoded->aoc_s_count);
     for (idx = 0; idx < decoded->aoc_s_count; ++idx) {
         snprintf(prefix, sizeof(prefix), "Rate(%d)", idx);
 
         ast_str_append(msg, 0, "%s/Chargeable: %s\r\n", prefix,
             aoc_charged_item_str(decoded->aoc_s_entries[idx].charged_item));
         if (decoded->aoc_s_entries[idx].charged_item == AST_AOC_CHARGED_ITEM_NA) {
             continue;
         }
         rate_str = aoc_rate_type_str(decoded->aoc_s_entries[idx].rate_type);
         ast_str_append(msg, 0, "%s/Type: %s\r\n", prefix, rate_str);
         switch (decoded->aoc_s_entries[idx].rate_type) {
         case AST_AOC_RATE_TYPE_DURATION:
             strcat(prefix, "/");
             strcat(prefix, rate_str);
             ast_str_append(msg, 0, "%s/Currency: %s\r\n", prefix,
                 decoded->aoc_s_entries[idx].rate.duration.currency_name);
             aoc_amount_str(msg, prefix,
                 decoded->aoc_s_entries[idx].rate.duration.amount,
                 decoded->aoc_s_entries[idx].rate.duration.multiplier);
             ast_str_append(msg, 0, "%s/ChargingType: %s\r\n", prefix,
                 decoded->aoc_s_entries[idx].rate.duration.charging_type ?
                 "StepFunction" : "ContinuousCharging");
             aoc_time_str(msg, prefix, "Time",
                 decoded->aoc_s_entries[idx].rate.duration.time,
                 decoded->aoc_s_entries[idx].rate.duration.time_scale);
             if (decoded->aoc_s_entries[idx].rate.duration.granularity_time) {
                 aoc_time_str(msg, prefix, "Granularity",
                     decoded->aoc_s_entries[idx].rate.duration.granularity_time,
                     decoded->aoc_s_entries[idx].rate.duration.granularity_time_scale);
             }
             break;
         case AST_AOC_RATE_TYPE_FLAT:
             strcat(prefix, "/");
             strcat(prefix, rate_str);
             ast_str_append(msg, 0, "%s/Currency: %s\r\n", prefix,
                 decoded->aoc_s_entries[idx].rate.flat.currency_name);
             aoc_amount_str(msg, prefix,
                 decoded->aoc_s_entries[idx].rate.flat.amount,
                 decoded->aoc_s_entries[idx].rate.flat.multiplier);
             break;
         case AST_AOC_RATE_TYPE_VOLUME:
             strcat(prefix, "/");
             strcat(prefix, rate_str);
             ast_str_append(msg, 0, "%s/Currency: %s\r\n", prefix,
                 decoded->aoc_s_entries[idx].rate.volume.currency_name);
             aoc_amount_str(msg, prefix,
                 decoded->aoc_s_entries[idx].rate.volume.amount,
                 decoded->aoc_s_entries[idx].rate.volume.multiplier);
             ast_str_append(msg, 0, "%s/Unit: %s\r\n", prefix,
                 aoc_volume_unit_str(decoded->aoc_s_entries[idx].rate.volume.volume_unit));
             break;
         case AST_AOC_RATE_TYPE_SPECIAL_CODE:
             ast_str_append(msg, 0, "%s/%s: %d\r\n", prefix, rate_str,
                 decoded->aoc_s_entries[idx].rate.special_code);
             break;
         default:
             break;
         }
     }
 }
....|

DeFP’s Representation Model Architecture

DeFP flow

The above image illustrates our SA warning ranking approach. Particularly, from the source code and the set of warnings of the analyzed program, we extract the reported statements and their program slices associated with warnings. For each warning, the reported statement and the corresponding program slice are converted into vectors and then fed to the BiLSTM models to predict its likelihood to be TP. After that, all of the warnings of the program are ranked according to their predicted scores.

DeFP model

Identifier Abstraction Component

DeFP abstracts all the identifiers before feeding them to the models. In particular, variables, function names, and constants in the extracted program slices are replaced by common symbolic names. See source file to understand identifier abstraction rules.

Dataset

In order to train and evaluate an ML model ranking SA warnings, we need a set of warnings labeled to be TPs or FPs. Currently, most of the approaches are trained and evaluated by synthetic datasets such as Juliet [1] and SARD [2]. However, they only contain simple examples which are artificially created from known vulnerable patterns. Thus, the patterns which the ML models capture from these datasets could not reflect the real-world scenarios [3]. To evaluate our solution and the others on real-world data, we construct a dataset containing 6,620 warnings in 10 open-source projects [4], [5].

DOWNLOAD LINK
^{Read subject systems’ source files with the proper encoding to avoid misplacing warning locations}

No.	Project	Buffer Overflow			Null Pointer Dereference
No.	Project	#W	#TP	#FP	#W	#TP	#FP
1	Asterisk	2049	63	1986	133	0	133
2	FFmpeg	1139	387	752	105	37	68
3	Qemu	882	396	486	72	39	33
4	OpenSSL	595	53	542	9	2	7
5	Xen	388	15	373	23	6	17
6	VLC	288	20	268	16	2	14
7	Httpd	250	45	205	17	0	17
8	Pidgin	250	13	237	242	0	242
9	LibPNG	170	9	74	2	0	2
10	LibTIFF	74	9	65	3	3	0
#	Total	5998	1010	4988	622	89	533

^{#W, #TP and #FP are total warnings, true positives and false positives.}

Experimental Results

RQ1. How accurate is DeFP in ranking SA warnings? and how is it compared to the state-of-the-art approach CNN by Lee et al. [6]?

WN	Project	Method	# TP warnings found in top-k% warnings
			Top-5%		Top-10%		Top-20%		Top-50%		Top-60%
			Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall
BO	Qemu	DeFP	82.22%	9.34%	67.78%	15.40%	65.14%	28.78%	52.27%	58.08%	50.38%	67.43%
	Qemu	CNN	71.11%	8.09%	53.33%	12.13%	46.86%	20.72%	44.32%	49.25%	43.02%	57.57%
	FFmpeg	DeFP	67.27%	9.56%	61.74%	18.34%	52.43%	31.00%	38.95%	57.37%	37.72%	66.66%
	FFmpeg	CNN	30.91%	4.40%	31.30%	9.30%	33.24%	19.64%	32.46%	47.80%	33.04%	58.39%
	Asterisk	DeFP	34.00%	53.97%	18.54%	60.26%	10.73%	70.00%	5.18%	84.10%	4.56%	88.97%
	Asterisk	CNN	11.00%	17.56%	8.78%	28.59%	7.56%	49.36%	4.49%	72.95%	3.82%	74.49%
	COMBINED	DeFP	66.00%	19.60%	56.00%	33.27%	43.92%	52.18%	27.50%	81.68%	24.82%	88.42%
	COMBINED	CNN	43.00%	12.77%	39.67%	23.56%	34.25%	40.69%	25.40%	75.45%	23.46%	83.56%
NPD	COMBINED	DeFP	80.00%	26.93%	65.00%	43.66%	47.20%	66.14%	25.81%	89.74%	22.58%	94.25%
NPD	COMBINED	CNN	63.33%	21.37%	43.33%	29.15%	38.40%	53.99%	21.29%	74.25%	19.62%	82.09%

RQ2. How does the extracted warning context affect DeFP’s performance?

WN	Project	Method	# TP warnings found in top-k% warnings
			Top-5%		Top-10%		Top-20%		Top-50%		Top-60%
			Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall
BO	COMBINED	RAW	47.00%	13.96%	39.83%	23.66%	32.92%	39.11%	24.30%	72.18%	22.71%	80.89%
		CD	58.00%	17.23%	40.50%	24.06%	25.25%	30.00%	19.83%	58.91%	20.96%	74.65%
		DD	48.00%	14.26%	42.33%	25.15%	34.92%	41.49%	25.03%	74.36%	23.10%	82.28%
		CD & DD	66.00%	19.60%	56.00%	33.27%	43.92%	52.18%	27.50%	81.68%	24.82%	88.42%
NPD	COMBINED	RAW	40.00%	13.40%	48.33%	32.42%	36.80%	51.57%	23.55%	81.90%	20.98%	87.65%
		CD	43.33%	14.71%	36.67%	24.84%	35.20%	49.41%	24.52%	85.36%	20.70%	86.47%
		DD	70.00%	23.40%	51.67%	34.58%	40.80%	57.19%	24.84%	86.47%	22.04%	92.09%
		CD & DD	80.00%	26.93%	65.00%	43.66%	47.20%	66.14%	25.81%	89.74%	22.58%	94.25%

^{RAW, CD, DD, and CD && DD denote the warning contexts which are captured by raw source code, program slices on control dependencies, program slices on data dependencies, and program slices on both control and data dependencies, respectively.}

RQ3. How does the highlighting reported statement (RP) impact the performance of DeFP?

WN	Project	Method	# TP warnings found in top-k% warnings
			Top-5%		Top-10%		Top-20%		Top-50%		Top-60%
			Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall
BO	COMBINED	W/O RP	59.00%	17.52%	54.50%	32.38%	41.00%	48.71%	26.83%	79.70%	24.43%	87.03%
BO	COMBINED	With RP	66.00%	19.60%	56.00%	33.27%	43.92%	52.18%	27.50%	81.68%	24.82%	88.42%
NPD	COMBINED	W/O RP	70.00%	23.66%	55.00%	37.12%	42.40%	59.61%	23.55%	82.03%	19.90%	83.20%
NPD	COMBINED	With RP	80.00%	26.93%	65.00%	43.66%	47.20%	66.14%	25.81%	89.74%	22.58%	94.25%

RQ4. How does the identifier abstraction (IA) component impact the performance of DeFP?

WN	Project	Method	# TP warnings found in top-k% warnings
			Top-5%		Top-10%		Top-20%		Top-50%		Top-60%
			Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall
BO	COMBINED	W/O IA	62.33%	18.51%	49.00%	29.11%	39.83%	47.33%	27.07%	80.40%	24.54%	87.43%
BO	COMBINED	With IA	66.00%	19.60%	56.00%	33.27%	43.92%	52.18%	27.50%	81.68%	24.82%	88.42%
NPD	COMBINED	W/O IA	56.67%	19.15%	48.33%	32.75%	41.60%	58.43%	24.84%	86.54%	22.85%	95.56%
NPD	COMBINED	With IA	80.00%	26.93%	65.00%	43.66%	47.20%	66.14%	25.81%	89.74%	22.58%	94.25%

References

[1] V. Okun, A. Delaitre, P. E. Black et al., “Report on the static analysis tool exposition (sate) iv,” NIST Special Publication, vol. 500, p. 297, 2013.

[2] N. I. of Standards and Technology, “Software assurance reference dataset.” [Online]. Available: https://samate.nist.gov/SRD/index.php

[3] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vulnerability detection: Are we there yet,” IEEE Transactions on Software Engineering, 2021.

[4] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” arXiv preprint arXiv:1909.03496, 2019.

[5] G. Lin, W. Xiao, J. Zhang, and Y. Xiang, “Deep learning-based vulnerable function detection: A benchmark,” in International Conference on Information and Communications Security. Springer, 2019, pp. 219– 232.

[6] S. Lee, S. Hong, J. Yi, T. Kim, C.-J. Kim, and S. Yoo, “Classifying false positive static checker alarms in continuous integration using convolutional neural networks,” in 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST). IEEE, 2019, pp. 391–401.

Cite us

@inproceedings{ngo2021ranking,
  title={Ranking Warnings of Static Analysis Tools Using Representation Learning},
  author={Ngo, Kien-Tuan and Do, Dinh-Truong and Nguyen, Thu-Trang and Vo, Hieu Dinh},
  booktitle={2021 28th Asia-Pacific Software Engineering Conference (APSEC)},
  pages={327--337},
  year={2021},
  organization={IEEE}
}