atrancon
Latest posts from Alvaro Trancon
-
Doublethink at work
Dec 02 ⎯ Hey you! Yes, you, the person reading this. I’m no psychic but if I had to guess there’s a high chance you spend most of your work in front of a screen. Maybe you do it from a steel and glass behemoth located the outskirts of the city, maybe from a “cool and hip” coworking with foosball table or, maybe even, from the coziness of your home. What’s your area of work? Are you a data obsessed analyst unraveling the mysteries of of your company databases? Maybe a creative designer currently preparing the Christmas season marketing assets? Or like me, a software engineer? Now I’m going to make a more focused guess. Knowing that most people will reach this article through Linkedin, below there are two affirmations and both are true for 90% of you. You belong to a mission driven product oriented company, with a strong set of values but adaptable to change, aligned employees, that focuses on their customers to solve problems and puts people first. You need money to live. Your company needs money to exist. You work for a company, both of you earn money and everybody’s happy. If this is true for you this article includes some tips on to improve how you experience you job. If you belong to the exception (you’re a freelancer and always have flown solo, a civil servant or even a rich heir living off the family money) at least there are funny bits. Why write this? Some years ago, I studied Psychology and the area the interested me the most was Organizational Development. This area includes things like Change Management or Employee Engagement; anything related to how people interact in a work setting. I mention those two areas because in this uncertain times a lot of literature has been written about retention, hiring, firing, promoting and engagement. Up until 2024 the main topic was hiring since we we’re in a boom, but money is tighter now and the conversation has switched to layoffs. Regarding individual development, there’s the whole ‘soft skills’ topic. These are individual qualities/attitudes that employee’s have in some degree. Some of them are more useful in some jobs, like a detailed oriented analyst; and others are useful for everybody (teamwork, adaptability). The idea for this article has been going on and off of my mind for a few years. From my own experience working at differents types of companies (start-ups, corporate enviroments…) and talking to colleagues, I strongly think is that there’s a “hidden” soft skill very useful for both your mental health and your work experience. Doublethink Doublethink is a concept from the novel 1984, by George Orwell. It’s one of the most referenced books in media and most of you know it. There are many concepts we use today that stem from the book, like the Big Brother, but for this article I’ll focus on doublethink. The book defines it as: To know and not to know, to be conscious of complete truthfulness while telling carefully constructed lies, to hold simultaneously two opinions which cancelled out, knowing them to be contradictory and believing in both of them, to use logic against logic, to repudiate morality while laying claim to it, to believe that democracy was impossible and that the Party was the guardian of democracy, to forget whatever it was necessary to forget, then to draw it back into memory again at the moment when it was needed, and then promptly to forget it again, and above all, to apply the same process to the process itself—that was the ultimate subtlety: consciously to induce unconsciousness, and then, once again, to become unconscious of the act of hypnosis you had just performed. “Wait a second Alvaro. I don’t tell lies and repudiate morality”. I know, I know, don’t worry, I’m not accusing you of anything. This definition is just to give context, let’s take the Wikipedia one. Doublethink is a process of indoctrination in which subjects are expected to simultaneously accept two conflicting beliefs as truth, often at odds with their own memory or sense of reality. “And how does this apply to my job, Alvaro?” Doublethink comes in handy when you experience things at work that don’t seem to make sense. Once you look through the prism of doublethink, things become coherent. Let’s see if you have ever experienced any of these situations. Doublethink at “work” (pun intended) Examples This project is gold / This project is a dumpster fire You get into a new job or a new project and things look great on paper, only to find out that things are barely holding up. Or someone with enough power has decided to do something that the rest of the team knows it’s not going to work, but you cannot say no. We put people first / We need number to go up You have pep talks, seminars, events with food and drinks if you’re lucky… Your company presents itself like a place where it’s employees are cared for and thrive profesionally. At the same time, everybody is expendable and if investors aren’t happy people can be laid off in a single videocall with not much explanation. A bit unrelated, but it’s very funny to me when Spanish companies post job offers where they say one of the benefits is 22 holidays a year, which is the minimum required by law. This is a priority / Nope, it wasn’t Imagine this: you’re in a planning meeting; and you agree to do an extra project to deliver within an unspecified timeframe. You all agree that the benefits will be worth the extra workload during this time. But the truth is everybody is quite busy so nobody pushes for the project and people forget. 2 weeks before delivery, the boss asks about the project and suddenly it’s an all hands on deck situation. Bonus points if after delivery the project gets abandoned. Extra bonus points if people ask about it 6 months later and the process repeats. Focus on the important / Administrative work This one is one of my favorites. Everybody wants to “add value” and do productive work; but sometimes the administrative part takes more time of your day than real work. Status meetings that could be summarized in an email, lengthy surveys with questions you’re not sure how to answer… Just think of all the work you could get done instead. We got money / we ain’t got no money This is something I’ve seen more commonly repeated in startups, but not exclusively to them. On one hand, the company boasts about record profits or organizes lavish events, but on the other there’s no money to pay the Figma license of the team designer. I like my job / I hate my job Not only does your company engage in doublethink, you probably do too. There are days where successes are celebrated but others you just want to quit and go live in a cabin in the woods. We’re making the world a better place / Are we? This doesn’t worry everyone and it’s completely fine, but many of us have this dilemma. Our jobs may be engaging and stimulating, but sometimes one does wonder: what for? There are jobs inherently positive like a doctor who at the end of the day has saved a life; but does “engaging the customer with a personalized shopping feed to increase time spent and revenue” really help anyone? Tips for managing doublethink All the examples above have two opposing ideas and yes, they are both true. We percieve reality filtered by our own ideas and things are not stable in time, they change. In all fairness doublethink is not limited to work and some may even say that it’s an adaptive trait, a mix of adaptability and resilience. So, talking from my personal experience, here are some tips to deal with these situations: Give your work and your company the deserved attention. Going back to the beginning, you have material needs that need to be met. Also, we all want to feel proud of a good job and have good relationships with colleagues. But with “deserved attention” I mean no less, no more. Life has many other things besides work. It being family, friends, hobbies or whatever. And you should reserve time and energy for those. Always remember that there are good days and bad days. Self explanatory. Learn what battles to pick. Sometimes you know something is not going to end well and you can’t avoid it, while others there’s room for discussion. It’s better spend your effort on things you can change. Like Blade would say: “some m******s are always trying to skate uphill”. Don’t be one of those. Empathy. Throughout this article I’ve been talking about “the company” like an entity with a mind of its own. But companies are made of people, and people do people things. From the youngest hire to the company CEO everbody has a job to do, we all feel shame when we make mistakes, we answer to someone and we sometimes disagree with our colleagues. People in different departments have different interests (more or less aligned) and see things differently, so try putting yourself in their shoes once in a while. Humor. This one is quite personal, but for me some of the situations described above are funny in a twisted way. The irony of spending 3 months on a project with lots of attention from management, only to be abandoned 1 day before releasing in production and never be talked about again, it’s well worth a laugh. If your company allows you to do so (and I’ve been at places like that), it’s a good signal. Thank you for taking the time reading this, I hope you’ve enjoyed it.
-
Introduction to Voice Agents
Nov 26 ⎯ First, a few words Hi, my name is Álvaro Trancón, and I've been building software for over 8 years. After studying Psychology I decided to switch careers and found my passion working with code; after all machines are quite simpler than people and you're guaranteed to get deterministic results. Most of my experience has been in building web applications with a focus on the backend. At this point I want to thank Pau twice, once for giving me the opportunity to work at Factorial and second for creating this platform! ;-) Technology evolves, and with the arrival of Chat GPT and LLMs to the general public a new world of tools and applications has opened up. These last months I’ve worked at Quintess building voice agents aimed at mechanical operations. Funny thing I mentioned before my preference for determinstic results, now I'm back with working with things that are never the same twice. And though I was fired recently (just a day ago) for understandable reasons, I want to take the learnings of this short stint and show that the beginning of the learning curve for a developer with some experience it’s quite easy (but very hard to master). I don't consider myself any expert regarding AI, but after a few months working with it I feel like I know enough to explain the basic concepts to other people, and this article is my attempt at doing so. Also, with the speed everything is changing, maybe in one year this is completely outdated. Anyway, let's get into it. Table of Contents 1. What’s a Voice Agent - How inference works 2. The Project - Burger Order Taker - How to run it - Functions - State - Prompts - Event handlers 3. Finishing thoughts What's a Voice Agent Simplifying to the extreme, a voice agent it's a "chatGPT" with extra Large Language Models (LLMs). This is my own definition so take it with a grain of salt, but I define it as a program that interacts with one or more users through audio instead of a screen in real time, uses LLMs for content generation and may (or not) interact with other programs or agents. We could argue that the old "Press 1 or say yes to continue" phone bots are voice agents, but since they're responses are prerecorded I think is reasonable to exclude them. Pipecat is an open source Python framework for easily building voice agents in a "plug and play" philosophy regarding the different providers (more on this later). At the end of the day a voice agent is a piece of software (running locally or in the cloud) with the following lifecyle. conversation flow diagramOpen a communication channel. This could be via WebRTC, Web Sockets or other channels. Daily (the creators of Pipecat) provide easy to integrate virtual rooms. Using an initial "context" prompt defined by the developer (setting the agent's persona and instructions), begin the conversation when the user connects. From here on the loop goes like this: The user speaks, and their audio is captured by the device. The audio is transcribed into text (Speech to Text, or STT). The transcribed text along with the initial prompt and conversation history is sent to an LLM. Note: conversations can "degrade" over time since the context window is limited - the more tokens or words, the lower the quality beyond a certain point. The LLM generates a text response. At this point, it can also trigger function calls to execute code, if defined. The text response is converted to audio and delivered to the user (Text to Speech, or TTS). We can config different parameters for this stage, like the voice used or the speech speed. The loop repeats for each turn. After the conversation ends (user disconnects, program finishes, or another event occurs), custom code may be executed depending on your requirements. There are many different providers for these stages: for example Google offers the 3 services, while others like Deepgram excel at one or two things. The good thing about Pipecat is that there's support for many and it's very easy to switch between them. How inference works Each time the user speaks, the entire conversation history is sent to the LLM along with the new transcribed text. The conversation is structured as a list of messages with different roles: system: The initial prompt defining the agent's role and instructions user: What the customer says, transcribed from audio assistant: The agent's previous responses This message history grows with each turn, and even though current context windows (the maximum amount of text, measured in tokens, that a LLM can process at one time) are quite big, this means we could reach the limit. There are techniques to manipulate the context, but they are not in the scope of this article. DISCLAIMER: From this point of the article on, I've used gen AI tools to help me write both text and code (not blindly pasting results and letting agents run free). It is my belief that like any other tool they are useful, but they are best used as a "crutch" to deal with some aspects of work. I could extend myself but I think there's material for another article. The Project To demonstrate how to use Pipecat, we're going to build a very simple voice agent, acting as burger restaurant employee. It's job is to take an order from a customer using only items from a catalog, read back the price to confirm and create the order. The repository is available to clone here: https://github.com/A-Tr/burger-bot B.O.T.: Burger Order Taker The starting point of this agent comes from the official quickstart available at Pipecat docs. At its core, a Pipecat bot consists of: Transport: Handles the communication channel (WebRTC, Daily, etc.) STT Service: Converts speech to text (Deepgram) LLM Service: Generates responses (I used Google) TTS Service: Converts text to speech (I used Cartesia). When you sign up, where you have a playground to test different voices; they are suited for different languages. Pipeline: Connects all components in a processing chain Here's a simplified version of what a basic Pipecat bot looks like: This project extends this basic pattern by: Adding session state management for tracking orders Registering custom functions that the LLM can call Make some API calls to load the catalog and create orders Loading a system prompt that defines the agent's personality and behavior injecting the catalog from the previous step. Register event handlers to trigger actions during agent lifecycle You can check the full file in bot.py. Enough chit chat, I want to see it work For a voice agent to work, you need a way to connect to the user and transmit audio. Thankfully, the Pipecat library provides us with the package pipecat.runner.run for local development. This packages exposes a main method, which under the hood and with the default configuration; creates a WebRTC server an a simple frontend to connect to. The only rule for this is to work, is that the bot file must have a bot method. Get some API keys, install dependencies and start the bot with uv run bot.py. Navigate to http://localhost:7860/client and you will be greeted with a simple interface to connect to the agent and some interesting info, like conversation transcripts. screencap of web interfaceImportant stuff Functions Functions (or tool calls) allow the LLM to execute deterministic code during the conversation. In our burger bot, you can check them at /functions folder in the project. Each function has two components: Schema: Defines the function signature (name, description, parameters) that tells the LLM when and how to call it Handler: The actual code that executes when called, returning a string response. Every function takes a Pipecat FunctionCallParams object, which contains the function arguments specified in the schema. We also use the method params.result_callback to send the data we get from our function to the LLM. Functions need to be added to the tools schema and registered within the LLM service: The agent decides when to call these functions based on the instructions in the system prompt and the conversation . This is the "non-deterministic meets deterministic" aspect: the LLM decides when to call functions (guided by the prompt), but the functions themselves execute deterministically. Session state We use a typed model, SessionState, to store the order items since it provides type safety, validation out of the box. This helps a lot when we need to be deterministic (if a user adds a burger, we need to make sure that a single burger is in the order). Instead of relying on the LLM to remember the order (which could be inconsistent or forgotten), we store it in our typed structure. When the LLM calls read_current_order(), it reads from this state, not from its own memory. This ensures that order totals, item lists, and confirmations are always accurate and consistent, regardless of how the conversation flows or how many tokens have been used. The Pydantic model also provides validation (e.g., ensuring quantities are positive) and helper methods for common operations like calculating totals or clearing the order. Prompts A good system prompt defines the agent's identity, the goal to be achieved, and the steps to do so. It’s also recommended to include “style” instructions to make the llm “aware” that it’s messages are to be outputted to a TTS model. Our burger bot, is an employee name Same whose the task is taking orders and has some methods available to use the order with. Here's a snippet showing the structure: Why detailed prompts matter: LLMs are non-deterministic—given the same input, they might respond differently each time. A detailed prompt with clear instructions, conversation flow, and function usage guidelines helps guarantee consistency in the conversation. Without explicit guidance, the agent might skip steps, forget to confirm orders, or use functions incorrectly. The prompt acts as a constraint, steering the agent toward the desired behavior pattern while still allowing natural conversation. Event Handlers Event handlers allow you to execute code in response to lifecycle events in the voice agent's execution. In Pipecat, you register event handlers using decorators. While the transport object is a common place to register handlers (for connection or disconnection events), other classes in the Pipecat framework may also expose event handlers for different lifecycle events. In our burger bot, we handle two key events: 1. Client Connected This event fires when a user first connects to the voice agent. The LLMRunFrame() tells the LLM to generate a message the moment the client connects. Since the only message in context is the system_prompt, it will generate the greeting. Without this, the agent would wait for the user to speak first, which might feel awkward. 2. Client Disconnected This event fires when the user disconnects or the session ends. It's the perfect place for cleanup: DISCLAIMER: Here I stop using AI. Finishing thoughts I hope that this has been as entertanining for you to read as it was for me to write it and that you’ve learned something along the way. My initial idea was to only write this article, but since now I’m going to have much more free time, there’s some topics regarding voice agents that deserve their own article. How can we “test” voice agents and make sure they work as expected? The wonderful world of evals Support for other languages apart from English Types of transports and how to connect with them Dynamic context switch Creating you own Frontend to connect the agent Thanks for reading this, see you soon.