Presented by

  • Kathy Reid

    Kathy Reid
    https://kathyreid.com.au

    Kathy Reid works at the intersection of open source, emerging technologies and technical communities. Over the last 20 years, she has held several technical leadership positions, including roles as Digital Platforms and Operations Manager at Deakin University, managing platforms such as WordPress, Drupal, Squiz Matrix and Atlassian Confluence, technical lead on projects involving digital signage and videoconferencing, and has worked as a web and application developer. More recently, she has run her own technical consulting micro-business, and been engaged on a variety of projects involving data visualisation, certification applications and emerging technologies workshops. She was previously Director of Developer Relations at Mycroft.AI, an open source voice assistant startup, and President of Linux Australia, Inc, a not for profit organisation which advocates for the use of open source technologies and runs technical events such as Linux Conference Australia. She brought GovHack – the open data hackathon – to Geelong in 2015 and 2016 and in 2011 ran Geelong’s first unconference – BarCampGeelong. Most recently, she worked as a voice open source specialist for Mozilla. Kathy holds Arts and Science undergraduate degrees from Deakin University and an MBA (Computing) from Charles Sturt University, a Master in Applied Cybernetics (MAppCyber) from Australian National University, as well as several ITIL qualifications. In 2019, she was one of 16 people from across the world chosen to undertake a Masters Program in a brand new branch of engineering at the Australian National University's 3A Institute, where she is now a PhD candidate researching voice data and ways to prevent and respond to bias in machine learning systems that use voice and speech, like speech recognition. Kathy recently completed a Research Partnership with Mozilla's Common Voice team, where she used Mozilla Common Voice data to assess the performance of the Whisper speech recognition on accented English, showing it was much less accurate for many spoken accents.

Abstract

In recent years, there has been an explosion in generative AI. Most of us are now familiar with tools like ChatGPT, Midjourney, Sora, and others. At the heart of generative AI is a machine learning architecture called the "transformer", which is fed by huge datasets - text, images and videos. Those datasets are "tokenised" - cut up into chunks which the transformer can ingest. Those actors who can obtain the most tokens can generally train the best models (for various values of "best"). We are now witnessing a state of conflict enacted by creators of generative AI models - big tech, grassroots collectives and other corporations - who seek to obtain as much data as possible for tokenisation - while denying access to their competitors. These are the Token Wars. And they're the reason not *everything* should be open. --- ## TOKEN POWER: A technical grounding on transformers, tokens and how they are used to build generative AI In this part of the presentation, Kathy will provide a technical grounding on generative AI, how the transformer architecture works, and in particular the concept of attention. She will briefly cover the concept of tokenisation for data input into transformer models, and the types of tokenisation that are often used. She will show how the volume of tokens used with transformers directly affects the performance of the models that are trained. This is the Token Power - the advantage that combatants in the Token Wars seek to achieve, and deny to others. ## TOKEN BATTLEFIELD: Where do tokens come from, and why is harvesting them contentious? Here, Kathy turns attention (hah! pun!) to the processes used to harvest tokens - web scraping, APIs and private data deals. She will cover common web scraping datasets such as Common Crawl, and recent changes to websites included in Common Crawl which constrain its usefulness. She will cover the "state of play" that is emerging, as some websites sell their content under commercial licensing agreements, and others seek to prevent the unauthorised scraping of proprietary content. Moreover, Kathy will cover the shift of the web from the ideal of an open knowledge infrastructure into a bot-filled, AI-generated clickbait-filled void of token noise - the "dead internet". This is the Token Battlefield - the environment and context in which the Token Wars are being fought. ## TOKEN TACTICS: Guarding your token treasure - and why everything shouldn't be open In the final part of the presentation, Kathy will examine tokens - and data - as a form of capital - showing the choices now faced by organisations in an environment of poor economic conditions, unregulated web and data scraping, and concentrated market power of generative AI companies. She will cover strategies to address the Token Wars - such as protecting tokenisable data through user agent detection, networking heuristics and login-walling, alongside the commercial realities that are leading sites such as Stack Exchange, Reddit and Condé Nast to ink deals with OpenAI. She will distinguish _data_ as something given from _capta_ as something taken. Further, she will show how some forms of data - rare, precious and therefore highly valuable for tokenisation - should be strongly protected for cultural, societal and historical reasons - and how we may hoard our collective token treasure. These are Token Strategies and Tactics - the measures that players in the Token Wars will use on the Token Battlefield to gain Token Power. To conclude the presentation, Kathy will lay out some possible Token Futures - what might we expect to see in this space over the next 12-18 months?