In this study of exploring the tradeoff between utility and privacy, we have used the S&P 500 data as a proof of concept, for Association Rule Data Mining using privacy-preserved data. The purpose of choosing S&P 500 was that unlike synthetically generated data, S&P 500 index though public, was real-world data nonetheless and hence incorporated in it all the political and social events that occurred over the period of time under consideration. Secondly, we could use stock price as a quasi-identifier. Any seasoned stockbroker would be able to identify the organization simply by looking at the stock price fluctuations over a period even if the ticker information is removed. And stock price is the most important variable used in any data mining related with stock market. Hence, we had to perturb this one variable in order to test for anonymity and utility. Finally, the data being numerical in nature helped us utilize the statistical data perturbation techniques. Association rule mining techniques were first applied to the original data. The data was then perturbed and, once again, using binary similarity algorithms, trend matching was performed on the identified stocks from the original data. The three tests of different random data generations were evaluated for data utility and privacy. The results re-emphasized on the tradeoff between the data utility and privacy where greater perturbation meant greater privacy but higher utility loss and vice versa.
Key words: Keywords: Association Rule Mining; tradeoff; Privacy-preserved data; Utility.
|