The Hazards of Using Big Data

Computer Science

The Hazards of Using Big Data

By Daniel Pomerantz

Cohort 2020-2021

Despite (or maybe because of!) having a technical background, my goal for the AI release was to develop materials related to the ethics and the effects of having more and more automation in the world. It is critical that as individuals, we can be 鈥渋nformed citizens鈥� and my idea was (and still is) to have a course dedicated to giving students a basic understanding of the pros and cons of the various current events. A key learning objective of the course would be to define enough vocabulary terms so that they are able to, for example, critically analyze the contents of a newspaper or magazine article.

That said, most of these topics can also be presented in isolation and within the context of other courses. Below are some of the resources I鈥檝e developed to aid teachers in presenting this material. Some specific existing courses where I imagine these resources could be useful include 420-BWC (Introduction to Computers), 420-BXC (Introduction to Programming), or an ethics course such as 345-BXH (Applied Ethics). They also can be presented in many different Computer Science department courses as lessons related to the importance of asking critical questions about how one鈥檚 work will be used in practice.

Resources

Big Data:

This is probably the most important module as it aids in understanding the limitations of big data. We should not just 鈥渢hrow the data into a machine learning algorithm and see what it comes up with.鈥� Doing this is extremely dangerous for several reasons. For starters, this can result in unfair, biased, and discriminatory conclusions that unfairly target groups. These groups are also usually the same groups that are already underrepresented in many places and the machine learning algorithms, when applied recklessly, only serve to exacerbate this.

To understand this, we must consider that while there may be data to support a claim, there are frequently underlying reasons (hidden variables) that lead to the results. This is the idea of 鈥渃orrelation vs causation.鈥� An example of this is given in the slides: It turns out that ice cream sales and shark attacks occur at the same time (due of course, to the hidden variable 鈥渟ummer鈥�). It would be wrong to conclude that since they occur at the same time, that we should ban ice cream sales to reduce shark attacks. Another famous example of this is the claim that pirates prevented global warming (see graph below).

Back when there were more pirates on Earth, the global temperature was less. But this is obviously not a question of causation. Hence we always should seek to explain these data phenomena and not just 鈥渢rust the data.鈥�

Additionally, there is a risk of 鈥減-hacking鈥� which results from the concept of 鈥渟tatistically significant.鈥� In general, we say results are 鈥渟tatistically significant鈥� if we believe that the results would only occur by chance 5% of the time. For example, in medical studies, participants are divided into two groups: a control group that receives a placebo pill and a group that receives the medicine. If the group receiving the real pill has improved outcomes (compared to the control group) that would only occur 5% of the time in the case that the pill did nothing at all, then they will conclude that the results are statistically significant and that the pill is useful.

Essentially, what this can end up meaning though, is if you test 1000 different hypotheses, then inevitably some of them will be true by accident. If something has a 5% chance of occurring, and you try it 1000 times, it will certainly occur several times. For more on p-hacking, please see the following link

Big Brother:

In this set of slides, there is a discussion of the term 鈥淏ig Brother鈥� as a proxy term for 鈥渟urveillance鈥� and the notion that everything we do is being stored. What obligation do private companies such as Google have towards maintaining our privacy? What about the internet service providers? Should they be required to purge our web histories after a certain amount of time?

Data tracking has some positive uses; Google has developed an application called 鈥淕oogle Flu Trends鈥� (which was later removed). The idea was to prevent the spread of a disease by quickly identifying trends such as an increase in search queries that suggest illnesses (e.g. 鈥渋s one degree above normal a fever?鈥�). Who should own this data though? While it鈥檚 easy to say private corporations shouldn鈥檛 hold it, big government also is a problem. (In fact, the original use of the term Big Brother was for government.)

Facial Recognition:

In this (shorter) set of slides, some of the pros/cons of facial recognition are discussed. There are some advantages of using facial recognition, for example, to catch 鈥渃riminals.鈥� But these advantages come with huge risks (police state, constant advertisement). An additional question not mentioned in the slides is 鈥渟hould parents be allowed to post pictures of their children to social media?鈥� This is being done, without the consent of their children (who cannot provide informed consent). These pictures can then be used by future algorithms (which will no doubt be better than today鈥檚 algorithms) to identify their children in many ways.

Text Generation Systems:

This slide deck was the first part of a presentation given at Dawson鈥檚 Ped Days in 2020. It provides an overview of the history of text generation systems as they improved from early generation ones to today鈥檚 ones, capable of writing very advanced and detailed texts. This talk also discusses an important notion in AI called the 鈥淭uring test鈥� named after Alan Turing. The idea is that an AI system is considered to pass the test if a user of the system cannot determine whether the system is an AI or a human.

Turing test:

Below is an amusing chat between Eliza, a chatbot imitating a psychologist, designed in the 1960s at MIT, and a human. The chatbot was designed to follow a script.

Other Interesting Links

Here are some other interesting links, relevant to artificial intelligence.

聽We frequently make split-second 鈥渄ecisions鈥� as drivers of cars. If a car stops short in front of us, and we swerve to avoid it鈥攂ut hit someone else in the process鈥攚e would normally write this off as a 鈥渞eflex鈥� (hence the term 鈥渁ccident鈥�). However, in designing self-driving cars, we are forced to program these choices ahead of time. The video discusses some ramifications of this. The scholars featured in this video also did a massive survey of people, from across the globe, to understand their personal choices in these situations. Although the data is interesting, this begs the very important question 鈥渄o we really want to determine our morals based on majority rules?鈥� Such 鈥渞easoning鈥� has been the foundation of many of the world鈥檚 worst historical atrocities.

: This is an interesting summary of the famous 1996 chess match between IBM鈥檚 Deep Blue and Garry Kasparov, the world champion of chess at the time. This match was personally interesting for me, and got me interested in artificial intelligence because I play chess competitively. It鈥檚 interesting that at the time of the event, the main advantage the computer had was a psychological one鈥攊t didn鈥檛 get tired! Today鈥檚 computers (such as Alpha-Go) are so much more powerful that there is simply no contest in chess match between 鈥渕an vs machine.鈥�

Initial Presentation: Here is a presentation I gave during one of the meetings of Dawson鈥檚 AI-themed community of practice. It is included here for completeness as well as context. During this presentation, I discussed my goals for the release

张百乔女友裸照

Dawson AI