Data Scavengers

31 minutes read | First published: June 12, 2023
By Marek Tuszynski
When we think about evidence-based investigation and storytelling, we make a certain promise and assumption – that our evidence, data and information will be sound, valid and verifiable. In short, we make a commitment to the trustworthy foundation of our version of events, our interpretation of the facts and the validity of our conclusions. In this context, the material we choose to work with (data, that is) appears to be something solid, comprehensive, authentic, perhaps even neutral and unquestionable. In some areas of data visualisation and evidence–based storytelling, we may come close to this pure notion of data. But, more often, we have to deal with the opposite scenario, in which the data we have access to is unreliable, fragmentary, imprecise, random, unpleasant, hypothetical, risky, dubious... the list goes on.
This text will provide some examples of how to deal with such data sets and how to work through them in order to yield interesting results and stories. One method is to encourage people to look for datasets, to ask for them, to collect them and then to use them. The other approach is to be admit that the data we end up working with is questionable, if not basically sh..y. But this is the reality of trying to bring light into informational dark spaces.
To paraphrase the hypothesis of the Dark Forest (黑暗森林) from Liu Cixin's novel The Three–Body Problem (earlier described by Stanislaw Lem in his novel A Perfect Vacuum) – I will not go into the depth of the actual hypothesis, other than to say that it posits that the reason we cannot find others living in the known cosmos is perhaps not because there are none, but because it is better to be quiet in this dark forest and pass by safely unnoticed than to be eaten alive. To put this hypothesis in the context of Tactical Tech’s work, in questioning the way Big Tech operates and trying to expose its social and political risks, they, Big Tech and its acolytes, maintain a high level of signal ambiguity (not silence, but ambiguity) on their part. It is as if you are dealing with an elusive substance (the dark matter, the silence of outer space – in our case we call it simply “grey data”) that makes it difficult to make definitive conclusions, and which hides any significant notions of negative impact. At the same time, these same companies and businesses have a sophisticated apparatus that knows a lot about the users of their services and infrastructure (on which we, the users, often depend). If you manage to lift the carpet a little too successfully, you are either flagged as a rule–breaker, or, again following the dark forest principle, you are silenced (fired from your job like some AI experts, your social media monitored or otherwise ridiculed, etc.).
So maybe instead of stretching the Dark Forest Hypothesis too much – we should give it a new name: while not wanting to use gaslighting (emotional abuse), we should perhaps call it the Great Silicon Valley Swindle, a kind of sophisticated multi–actor deception in which one group is told to give up their data for free and dependable services while the other groups is asked to give money for influence. The third group is told that you support fundamental freedoms and build tools for society and democracy, even while you are increasing polarisation, radicalisation etc…. Everyone is happy even while feeling that something is a bit off.
Let's look at some specific kinds of unreliable data in our work.

How To Turn Bullshitting into a Story, or How To Turn “Blah Blah Blah” into “La La La”

In our work at Tactical Tech, we have made numerous attempts to show the hypocrisy of Mark Zuckerberg and Facebook/Meta. Here, we will focus on our work Notes on an Apology Tour from 2019 produced together with La Loma. The object consists of two parts: the first is a classic Rolodex with hundreds of cards in it, and next to it, two notepads that look like two pages of printer paper.
Photo author
The Rolodex and Zuckerberg’s “Cheat sheet,” picture from the exhibition The Glass Room, 2019, Tactical Tech & La Loma (photo author)
Bear in mind that when you try to collect material a big tech CEO like Mark Zuckerberg, you encounter a lot of material that has already been vetted by an army of lawyers, PR experts and other advisers (with the possible exception of Elon Musk). So such material is not trustworthy per se – unless what you want to do is analyse such communications. We decided to tackle this by contrasting three sources of information that we could get our hands and eyes on.
First, the Rolodex, which has two sets of cards – the red set and the blue set. The red set contains all of the publicly available admissions, apologies, lessons learned, pledges and promises made by Mark Zuckerberg from 2003, when he launched his first product called Facemash, until the end of 2019 (when we made the object, just after the Cambridge Analytica scandal). The set of blue cards originally consisted of all the questions asked by US Senators and EU Parliamentarians during the Cambridge Analytica hearings, but we then removed the EU part as there were too many cards.
La Loma
Sample of a few “promise” cards, design La Loma
The two pages next to the Rolodex are a facsimile of a “cheat sheet” that Zuckerberg had with him during his testimony to the US Senate, which he left behind at some point during a break. While they were being retrieved by the assistant, an AP photographer (8) was able to take a high–resolution image of the pages.
La Loma
Fragment of Zuckerberg “cheat sheet”, 2019
What do you get when you put all this information together? One way of approaching this would be to weigh all the public statements Zuckerberg has made against all the questions he has been asked by policy makers, which might lead to some interesting conclusions, such as some of the questions were naive, or that they were asking about things he had already promised to fix or apologised for a long time ago. Or we could try to make some sense of his public voice (statements) versus his inner voice (the internal crib sheet). Would we learn more than that these two voices seem contradictory?
Our assumption was the opposite: we assumed these data sets were incomparable – a classic case of apples and oranges. Comparing them will not reveal anything meaningful. On the other hand, seeing how incomparable these sets are is already a story we wanted to tell, and we wanted to let the audience explore it at their own pace and interest. We were aware that for some, particularly in San Francisco where we were showing the object, it might be an insight into how you should or could run a company that wants to “move fast and break things,” and for others, it might be an illustration of the immense gap between those who break the rules and those who define them, and the fact that a company like Facebook defines far more rules than it breaks. The modus operandi of Big Tech companies is to enter unregulated territories at the intersection of new technologies, society and politics, push as far as they can into the first cracks they see, and then activate their legal and lobbying apparatuses to mitigate potential damage, while continuing to push into other areas that are not yet on anyone's radar, especially those whose radars are clearly out of date.
The two pages immortalised by the AP photographer are a tiny, specific and fragmentary glimpse into the strategy of a multi-billion-dollar company at risk of losing the public’s trust, being advised to come clean and with as few consequences for the business itself as possible. In the end, the scandal cost Facebook 643,000 USD in the UK and a 5 billion civil penalty in the US. Including legal costs and other costs the whole defence doesn’t compare with their 2019 revenue, clocked at seventy billion US dollars. The fine paid equalled roughly 7.1% of its annual revenue. And how should we interpret Zuckerberg's devotion to jiu–jitsu, which recently made him win some tournament medals? If you are wondering what this has to do with anything, jiu–jitsu (in particular the Brazilian jiu–jitsu he prefers) is one of the few martial arts that specialises in diverting and deflecting the force of an opponent, a masterclass in dodging.

“You might regret what you wish for” data

In 2016, together with artist Joana Moll, Tactical Tech decided to test the following hypothesis: it is possible to buy a lot of personal data legally and openly. At the time, there were many discussions about what kind of data was and wasn’t available, and affordable, along with a lot of questions about how the data broker business works. But there weren't so many documented examples. So the idea was to focus specifically on data from dating websites (as it is, by default, very personal) and to use only legal and open forms of data brokering (no dark net data buying). We identified the US broker website (in case you want to replicate our process – it is not possible for the datasets we got access to – thanks to Joana's and our research) and went online to buy some records. We decided to buy 1 million profiles for 136 euros from a service called Usdata, paying with our credit card. The rest is history – the whole investigation is documented and turned into an interactive auction.
Joanna Moll, taken in May 2023
Snapshot of the Investigation view of the Dating Brokers Project, Joanna Moll, taken in May 2023
The point of writing about this project here is to highlight the risks one might encounter when exploring different ways of verifying a seemingly innocent hypothesis. We ended up with 1 million personal profiles, which included pictures (almost 5 million of them), usernames, email addresses, nationality, gender, age and detailed personal information about all the people who had created the profiles, such as their sexual orientation, interests, occupation, detailed physical characteristics and personality traits. The point of our investigation was to make users of web services based on detailed profiles aware of the consequences of sharing data with relatively unknown companies with unclear privacy policies and undisclosed business models. Our aim was not to expose any of the personal data we might get our hands on, and certainly not to expose any users who have already fallen victim to terrible data practices.
So how do you show the results of your research? How do you use the data without exploiting the users (who are already being exploited, we discovered, many times)?
Joana, together with her collaborator Ramin Soleymani, was able to trace two separate networks of exploitation. By extracting the metadata from the images, they were able to reverse-engineer a hidden network of companies owned by other companies that apparently shared their data without users' consent across many of their own dating services (please read the actual research mentioned above). By analysing all the websites, they unraveled a vast network of third parties that enabled cookies and trackers (which do allow advertising) were given access to collect live data from users of the above dating services. They counted over 300 third–party cookies on sites owned by Match Group and IAC (which owns most of the dating sites that you know very well and those you may have never heard of).
Joana's key principle in the research, and in the artistic project that came out of this collaboration, was not to expose the people behind the data that we had access to. And yet we felt it was important to show the nature and scale of the data that was being processed and that was available to anyone, even with limited resources. The research part was easier to manage from this perspective – as the data we had was used as a source for expanding the research and collecting company data – there was no risk of exposing the data we had. In her artistic project, The Dating Brokers Auction, Joana decided to show the data we got.
Joanna Moll, taken in May 2023
A screen shot of the Auction part of the Dating Brokers Project by Joana Moll, taken in May 2023
The interface you encounter when accessing the project offers a specific auction describing the types of user profiles you might be able to access. It is updated every three minutes or so. There is no way for the visitor to make their own request or ask for specific data sets, but you can of course move between these randomised data sets extracted from the 1 million profiles. Each auction has a price tag: it is not a price you have to pay, but rather a provocation to get the visitor to evaluate what a price could or should be for such detailed and personal data sets. If you decide to 'buy' (there is no transaction involved – the project does not collect any data from its visitors), you will see pictures (in blue) of all the people whose data is being auctioned. By clicking on a picture, you can look at a specific individual record. Joana decided to go to the limit in terms of possible exposure of people, minimising this risk as much as possible. For the provocation to be a provocation, she had to retain an element of risk – either of finding oneself in the dataset, or of finding people one might know. However, the following steps make this almost impossible. There is no way to access the entire raw data set. The data is presented in randomised chunks – where there is no way for the visitor to interrogate it. You cannot search for any names, places or other variables – there is no way to query, filter or explore the data. All you are given is a random small subset of data every three minutes – there is no way to go back to previous auctions, either. The images used are blurred. There is also an emergency button to remove a face, as despite all the efforts to blur faces (automated) there is a small chance that some of the faces might still be borderline recognisable. The emergency button allows people interacting in the data to immediately trigger the removal of the face from the dataset.
Joanna Moll, taken in May 2023 The snapshot of the images view from the random Auction, Joana Moll, taken in May 2023
This uncanny part of the project (the potential of finding yourself or someone you know), which triggers our worst buyer's instincts, is essential to the project. Watching people interact with it highlights something we need to address when working with such datasets – the majority of us who frequent digital spaces, if given the opportunity to explore other people's personal data, would be more likely to do so than to ignore such an opportunity.
We are often asked by shocked audiences why they have never read about this project in any mainstream media – and we have tried. Except for small cases, we have always been turned down. Our guess is that this investigation has not only exposed horrendous data practices of certain companies, as well as their exploding business models, but has also exposed something that turned out to be (and still is) an ethically unsustainable but common way of relying on tracking-based digital advertising. What we exposed specifically in relation to dating sites is no different when it comes to any and all online based services that generate their income/revenue by hosting such advertising, sponsored content or third-party cookies.
Another investigation certainly raises a lot of moral questions about who collects, who exposes, who is exposed, and whether they should be.
Exposing the Invisible 2015
Screen shot of the series of interviews with M.C. McGrath, Exposing the Invisible 2015
I interviewed M.C. McGrath in 2015 for our Exposing The Invisible project, to introduce and explain the tool he had previously created, called ICWATCH. The tool he created consisted at the time of a database of an estimated 27,000 LinkedIn CVs of people working in the intelligence sector. The method of obtaining the data was very simple – M.C. ran a script that looked for any CV posted on LinkedIn that might have mentioned any of the code words relating to specific intelligence, surveillance and similar secret projects. It turned out that people working in this sector, looking for new opportunities, inevitably needed to use some recognisable references and professional slang that would prove their experience and expertise. M.C., driven by different principles, decided not to protect the identities of all these 27,000 or so people, but to expose them. His decision was based on the assumption that this data is available to everyone, that anyone can run similar scripts, and that it is in itself a security threat to a sector that should be more aware of it and able to do something about it. And besides that, someone should be watching the watchers. I will leave it here for your own judgement and to read the interview if you are interested.
You can also explore another investigation that into data extracted from popular personal finance apps and services.

Working with data without... actual data

You might have heard enough anecdotal stories that you started to suspect something more systemic is happening around us. That's what happened to us around 2016/17, when we were researching the use of personal data for political influence. We started hearing more and more stories about how collecting and analysing voters’ and potential voters’ personal data in the service of political influence was becoming a pretty lucrative business (this was before the Cambridge Analytica scandal broke).
Of course, there was no registry, repository or place where you could go to get data on the companies who were brokering data for political influence. You could, however, search the internet, study attendees and proceedings of various marketing and political conferences, and dig into the little research and reporting that was available. From these sources, Tactical Tech’s Data and Politics team began compiling a spreadsheet that represented a cross-section of the companies that seemed to be offering their data-driven services to political campaigns.
A few years later, we were able to make it easier for anyone interested to access more detailed information about companies that offer data-driven political services, what methods they use and how they do it on the ground.
Influence Industry Explorer, taken in May 2023
View of the main interface of the Influence Industry Explorer, taken in May 2023
What we have now under the Influence Industry Project consists of a database of over 500 companies claiming to provide influence services; a repository of resources consisting of 20 country case studies, work we have done with international partners on the ground during election periods; as well as some explanations and analysis of the methods used by the influence industry; and finally a set of learning modules to help those interested to learn more about some of the nuances of the influence industry and how they can research it in their own context. Again, you are welcome to dive in and explore the project, and don't miss our guidebook on the methods of the influence industry.
How did we get there – what kind of material do we have to work with? The initial research involved , mapping companies, accessing their material (websites, reports, talks, presentations), building a vocabulary and moving forward step by step.
Our Data and Politics team spent countless hours looking at whether and to what extent political parties spent money on various political influencing practices and campaigns, ranging from ads on social media to campaigning apps, to A/B testing, robocalls, and many more. We were ahead of others, but after the exposure of the Cambridge Analytica scandal, it became easier to get people interested in this work and to develop it further.
So why do we talk about “working without data” here? Because what we termed the influence industry is inherently opaque. Our primary evidence is based on analysis of self–described services, methodologies and tools – which we have no easy way of verifying, some of which may just be bravado, some of which may be gross overestimation, some of which may be PR, some of which may be the old school 'fake it until you make it' mantra. On the other hand, it is clear that the unprecedented production and accumulation of personal data (including the behaviours, habits, beliefs, and aspirations of people who use the web for basically everything these days) is opening up unprecedented ways of exploring and exploiting it.
The question we often ask is not really whether these data-driven methods work or how effective they are, but rather what sort of potential they have. What if the tools, the automation, the algorithms get significantly better and the people using them get more sophisticated? In a way, we are not looking at what we have now – but rather what future we are building now by allowing current practices to grow widely without much oversight or concern.
Over time, we have been able to expand our dataset from the UK and US using access to public registers of political party spending; it requires a lot of work to match the generic descriptions of spending with actual services that might be linked to actual political influence tools and practices. What we also suspect, looking at the figures, is that these practices are mainly available to the parties and candidates who are able to pay for them, widening the gap between small and large parties and expanding the political influence of those who can afford it.
There's another thing we don't have data on yet: the techniques we've analysed not only can help political entities gain power, but they also certainly generate revenue, as they ask for votes as well as for material support (donations), often in exchange for swag. This aspect of using the Influencer Industry's toolset to increase one's financial capacity still needs to be explored and analysed.

Ephemeral Data or How to Build Time Capsules

In 2019, we wanted to do some work on technology in exceptional times, in times of crisis. We were interested in finding out what kind of technology gets recognition during times when there is a massive need for solutions and a lot of resources are being put into solving those problems. And since we live in a time where almost any solution involves technology, we wanted to make some observations about AI in crisis in particular. When we were working on the concept and applying for funding (which we got from Onassis Stegi and won us an Ars Electronica nomination, we had different types of crisis in mind: environmental, political, social, economic, and so on. By the way, when we were about to start the project, Covid took over everything. So we thought we would not miss this opportunity. What came out of the intensive research was a project called Technologies of Hope and Fear.
Technologies of Hope and Fear, taken in May 2023
The landing page of Technologies of Hope and Fear, taken in May 2023
Almost similar to the previous Influence Industry project – we had very little to start with – we expected to follow any trace of anyone mentioning a technological solution to Covid. We made it clear that we were not looking for technologies related to vaccine development, and as much as Covid trackers and their role in the public discourse were a focus, we did not want to confine our research to them. The basic questions we were asking were, during a crisis, what kind of technologies get more traction and what kind of developments become normalised and taken for granted post–crisis? Are there new technologies being developed? What kind of technological development is accelerated? And finally, can we see any patterns and trends in terms of what kind of practices get traction?
The idea was not to try to build a comprehensive dataset that could be used as a solid reference on the players and their tools. But knowing that we were dealing with a temporary but intense crisis, there would be a lot of garbage we have to deal with, a lot of opportunistic ideas, a lot of repurposing of existing solutions, looking for new problems to attract some funding, etc. The idea from the start was to take a snapshot – an arbitrary, experimental, fluid and even controversial look at something that we felt no one else was looking at from such an angle.
What did we find? As predicted, lots of fluff, randomness and impossible-to-validate promises. We also found lots of repurposing, pivoting and reframing. All of which are explained in the project – so please read the short text and explore all 100 examples we chose to focus on (however, when we closed our dataset, it had over 300 entries).
The main findings were that all the solutions, whatever their purpose, focus or range, explored the use of technology to control the host rather than the virus. The majority were based on various levels of exploitation of personal data, either collected live or from a distance (time, space). We grouped them into specific categories where we think the technology being promoted is trying to answer four different questions:
  • How can we catch a virus before it becomes visible or spreads?
  • When a virus enters the host – how do we know it is there, and how do we know what’s going on inside the host's body?
  • There is no way of controlling a virus without controlling the movement of its host – this is how a virus spreads.
  • We still don't know if the pandemic will mutate and force us to live differently, or if another pandemic may take its place. How can we prepare to live in a potential 'no touch' future?
To answer these questions, we observed that different technological solutions were circling the wagons in search of investors, users and customers, and each of these questions also highlighted four types of intelligence that these technologies rely on:
  • 1. Ambient Intelligence: solutions are mostly based on large-scale (populations) and largely depersonalised data (this is not to say that it is not possible to de–anonymise these datasets, but the technologies do not require this, per se).
  • 2. Biometric Intelligence: unlike technologies that rely on ambient intelligence, the source is always a specific individual. Data is aggregated and so on, but the link to the subject is the closest, such as in technologies like personal trackers.
  • 3. Mobility Intelligence: from today's point of view, this will not really stand out as a significant category, but at the time the data was collected, this was one of the dominant problems – how to manage mobility in a dangerous environment, that is, when mobility amplifies the spread of the virus.
  • 4. Behavioural Intelligence: this may look like a version of biometric intelligence, but it's not. Here, we're talking about individual and group behaviour and how to control it (such as behaviour in crowded places) as well as how to modify it by training people to develop new behaviour, specifically in places like offices, schools, etc.
As noted above, the purpose of this data snapshot was not to provide a comprehensive overview of all the technologies that are being developed and promoted, but rather to try to capture what they are potentially normalising, as opposed to what would have been difficult to talk about prior to the crisis. One clear example here is location data: many companies would not readily admit that they had access to and processed such data. On the other hand, we also observed a significant push to normalise the use of space and behavioural control tools that would have been considered invasive prior to the pandemic, such as closed circuit camera networks with sensors, or sensor networks that monitor people in spaces. Another important observation we were able to make is that the majority of critical technological capabilities in times of crisis are in the hands of the private sector, not the public sector.
But to make it even more interesting and intriguing, we decided to screenshot every single website promoting the tools in focus and link them to their promotional videos. In many cases we had to use the Internet Archive – as solutions were coming and going very quickly, which we saw as a feature of the project rather than a problem. We found it fascinating the kinds of individual metaphors and narratives used by these companies to promote their solutions, and we wanted to present them for visitors to explore and draw their own conclusions.
Looking at it post–pandemic, we can see a ring of debris left behind by all these technological solutions, like the ring of defunct satellites littering Earth's orbit. Except in this case, it is not only dangerous because it might fall on our heads but also because it has introduced more surveillance and normalised it.

Grey Data that Is Kept Grey on Purpose, You Might Suspect

We have been trying to work with DensityDesign on a single project for a long time. Usually we didn't have enough good data to work with. If you know DensityDesign's work, you know that they are amazing at working with and visualising complex data sets. But every time we knocked on their door, all we had was something that was certainly complex, but much less computable than the datasets they normally work with. After much discussion, we decided to try to work with the dataset we had collected and improved over many years – the full story of which I wrote about earlier this year in “Adventures in Mapping Digital Empires.”
The project is called GAFAM Empire.
Technologies of Hope and Fear, taken in May 2023
One of the infographics from the GAFAM Empire project – called petri dishes, captured in May 2023
I am not sure which of my arguments convinced the DensityDesign team to agree to work on the project, but in the course of our many conversations it became clear that the data we had (we are talking about 1,210 records of GAFAM (Google, Apple, Facebook, Amazon and Microsoft) acquisitions between 1978 and 2022) had many problems embedded in it.
In one of our conversations, Ángeles Briones of DensityDesign drew my attention to a research paper by Dobrica Savić entitled “When is 'grey' too 'grey'? A case of grey data”
In it he says:
“‘grey’ indicates partially known and partially unknown information. More specifically, grey data represents small samples and poor information.” (excerpt from the above mentioned paper "When is 'grey' too 'grey'? A case of grey data" ditto)
What struck me was a reflection that in our cases of evidence-driven storytelling, we encounter what Savić calls ‘grey data,’ representing only a small sample – in our context it represents most of what we deal with on regular basis.
And it is not related to the methods we are using generating unreliable data sets. Rather, what we are able to gather that we consider the minimum necessary to be able to understand larger processes is by default poor, and, on top of that, seems that it wants to be poor to disable any attempts of any considerable efforts of coming to solid conclusions.
In trying to unpack and find ways of translating the data practices and business models of the actors we are looking at, to figure out what impact these have on society, we are flooded with poor information and grey data. Paradoxically, or not, those who are masters of data collection and its monetization have also perfected generating useless data about themselves.
When Ángeles summarised the data that we were trying to clean and unpack in the GAFAM Empire project, it was a great example of grey data, because what we had could be summarised in the following ways. The data was:
  • Incomplete – we can only get what we scrape and what has been made available.
  • Fuzzy – as it only offers interpretations. We don't know the details, the exact amounts, the assets involved or the reasons for the acquisitions. We don't even know what exactly was acquired and why.
  • In flux (constantly changing) – much of the information we unearthed has changed over time – before, during and after acquisitions.
  • Multivariate – as with many incomparable variables, this requires researchers to arbitrarily group variables into some meaningful categories.
  • Unverifiable (from unclear sources) – sources were often anecdotal, things mentioned incidentally in odd contexts, etc.
I would be curious to have a broader discussion not only with people and organisations working with technology, data and society, but also those working with big business and environment, with health and education and more. Is this the case in all those fields? That we are just data scavengers?

And a little appendix for those who persistently work with data, regardless of the weather

Some readers might interpret this text as exposing the fact that we are not able to get proper data. Maybe you’re thinking, because we are scavenging for unpolished datasets,we may be undermining the quality of our work. Those with a scientific or investigative mindset might believe that you should keep searching for clean data rather than making assumptions. However, I hope that this text shows two important things: a) scavenging is not bad - someone has to deal with messy stuff and recycle it and b) while you cannot produce reliable results based on unreliable source material , this unreliability, for us, is the actual story. Part of our work is to expose the unreliability of such datasets, and seek narratives that should trigger changes in what information is collected, by whom, who is responsible for it, and who is accountable for it. Rather than hopelessly waiting for someone to give us a bundle of clean data that we can use, we want to expose what narratives are hiding in the available data. If you want to use our guides on how to access lots of other forms of evidence and data, please explore our other resources listed below: