data visualisation

25 May 2016

fourth part of notes of Coursera big data specialization.

Visualisation is important as it can :

reveal patterns statistics may not
allow human to comprehend patterns in larger datasets than they could otherwise
exercices the highest bandwidth channel into the humain brain

Data Types and Visual Mappings

Data types can be classified as Nominal, Ordinal or Quantitative

Nominal : fruits (apples, oranges) . Operations : =, ne
Ordered : quality of meats ( B, A, AA, AAA) . Operations : =, ne, <, >, <=, >=, -
Quantitative : Interval(arbitrary location of zero) such as Dates, location or Ratio (zero fixed) like physical measurement. Operations : =, ne, <, >, <=, >=, -, /

Bertin in Semiology of Graphics in 1967 defined 7 Visual attributes : Position, Size, Value, Texture, Color, Orientation and Shape. (W1 L1 04 -slides 31). Bertin also said that variable types map to certain types of attributes

Position	N	O	Q
Size	N	O	Q
Value	N	O	(q)
Texture	N	(o)
Color	N
Orientation	N
Shape	N

Moreover, the Human brain perceptual properties follow the according list ( from more accurate to less accurate) : Position, Length, Angle/Slope , Area, Volume, Color/Density .

When mapping data to visual attributes, the challenge is to pick the best encoding( or mapping) from nay possibilities, consider :

importance Ordering : encode the most important information in th emost perceptually accurate wat
Expressiveness : Depict all the data, and only the data.
Consistency : The properties of the image (visual attributes) should match the properties of the data

Humans tend to be better at estimating length than either area or volume! Usually, length is estimated from 0.9 to 1.1, area from 0.6 to 0.9 and volume from 0.5 to 0.8. For example for a circle or area A, the human eye will perceive it as a circle of area $S=0.98*A^0.87$ J. J. Flannety, 1971, “the relative effectiveness of some graduated point symbols in the presentation of quantitative data..”

To critique a visualization, do the following : First, consider the purpose of the visu and who the intended audience is. Then, ascertain your initial reaction. Then, examine the visu in detail. Then, answer questions like : is the design appealing? is it immediately understandable? does is require excessive cognitive effort? Does it successfully highlight important info, does it omit important info? does it distort the info? is it memorable?.

Ethics and Privacy

Be very carefull about therepresentation of the test set if we want to do generalisation. In the “barrow alcoholism study”, participants were not in control of their data nor the context in which they were presented. Even those who did not participate in the study incurred harm. We have to smell the ethics. Researchers appear to have placed their own interests ahead of those of the research subjects, the client and society. Break ethical principles.

There is reponsabilities for Funders, Clients, Employers, Research Subjects, Research Team colleagues, Society. We have to be carefull protecting sensitive data.

You have to be carefull about security and protecting privacy. Access control, query control, perturbation-based techniques, secure multi-party computation, anonymity. Removing “identifiers” may not be sufficient to ensure anonymity. k-anonymity ensure that for any query, each record cannot be distinguished from at least k-1 other records. Two recurring features of privacy failures are :

high-dimensional data to distinguish individuals
combination of multiple datasets to re-identify individuals

Differential Privacy

Differential privacy is a guarantee from the data collector to the individuals in the dataset.
The chance that the noisy, publicized result will be R is about the same whether or not you include your data: $\frac{(Pr(Q(D_I) = R))}{Pr(Q(D_{I\geq i}) = R)} \leq A \forall I,i \in R$
Q is the privatized query
A is a value very close to 1. If A is large, no privacy is provided. If A=1, then the data has no utility
we define $A = e^\sigma$ for some small $\sigma > 0$

Laplacian noise

the sensitivity is $\Delta F$ . We know we must add noise that is a function of this sensitivity and the higher the sensitivity, the more we have to change the result to prevent an adversary from distinguishing between the two possible worlds. The simplest approach is Laplacian noise!

$f(x|\mu,b) = \frac{1}{2b} e^{(- \frac{|x-\mu|}{b})} is Lap(\mu,b)$ where

$\mu$ is the position
$b$ is the scale (the spread)

D is the dataset The scale depend on $\sigma$, $\lambda F$ , $b = \frac{\Delta F}{\epsilon}$ The posision should depend on F(D), $\mu = F(D)$

So privatized results $R = F(D1) + Lap(\frac{\Delta F}{\sigma})$

problems in differential privacy

It’s best for low-sensitivity queries. For example sums or max are problems. Consider the question “what is the total income earned by men vs women?” . A single very-high income individual could change the result by an enormous amount, so you’d have to add a lot of noise for this worst-case individual.
adaptive querying exposes more privacy: if I ask a series of differentially private queries, I can disambiguate between possible worlds. So we have to add even more noise to the results but doing so, we may destroy the utility of the output.

Reproducibility

The claim needs to be independently verified. Cloud computing can be an appropriate platfom for reproductible data science as it includes “Code + data + Environment”. Otherwise, there is also virtual machines and containers.

Cloud computing

Just like generators of factories from the 19th century had become unnecessary in the 20th century, the same is happening nowadays with servers. Most companies have their own and might switch to cloud (provided that issues such as privacy and security are solved).

There is infrastructure-aas, platform-aas, software-aas. Specific advantages of cloud for research is:

burst capacity (1k cores for 1 day costs the same (or less) thatn 1 core dor 1000 days).
reproductibility as investigators use the same tools and data.
sharing and collaboration
eliminate redundant infrastructure across units

who pays for reproductibility? Answer: you, you or them, them. codes for hosting data shouldn’t be neglected.

Observation on Big Data : the only solution is to push the computation to the data, rather than push the data to the computation! It takes days to transfer 1TB over the internet, copying a petabyte is operationally impossible.

info-T

data visualisation

Data Types and Visual Mappings

Ethics and Privacy

Differential Privacy

Laplacian noise

problems in differential privacy

Reproducibility

Cloud computing

Costs and sharing

Related Posts

maven Permgen OutOfMemory in scala spark 05 Jan 2017

deep learning : Y. LeCun at College de France 01 Oct 2016

dynamic libraries loading 02 Aug 2016