AI & Cybersecurity / Offensive AI / Deepfakes & Social Engineering

Offensive AI

Deepfakes & Social Engineering

Offensive AI

Synthetic media — video, audio, and text — used to manipulate individuals and undermine trust in communications.

A deepfake is synthetic audio, video, or imagery generated by AI to impersonate a real person. The technology is no longer experimental, voice cloning works from three seconds of source audio, and real-time video deepfakes are convincing enough to fool finance professionals on live video calls. What makes deepfakes dangerous in a security context is not the technology itself but how it slots into existing social-engineering playbooks. The traditional defense against business email compromise was "if it seems suspicious, get the executive on the phone," that defense is now actively exploited by attackers who can produce the executive's voice on demand.

What you'll learn

Key takeaways from this topic.
  • Explain how voice cloning and video deepfakes have changed the verification step in social-engineering attacks.
  • Identify the typical attack chain in deepfake-enabled fraud, from reconnaissance to financial loss.
  • Describe the verification controls that still work when audio and video can no longer be trusted as evidence of identity.

At a glance

Fast mental model before you dive in.
Core concepts
  • Voice cloning
  • Video deepfakes
  • Synthetic identity
Techniques
  • Executive impersonation
  • Real-time video fraud
  • Multi-modal attack chains
Defenses
  • Out-of-band verification
  • Shared-secret challenges
  • Process-based controls

Core idea

The technology behind deepfakes matters less than the assumption it breaks. For decades, organizations treated human senses as a verification channel. If you heard your CFO's voice on the phone, that was confirmation. If you saw your CEO on a video call, that was confirmation. Internal fraud-prevention training used to recommend "pick up the phone and call them back" as the gold-standard countermeasure to suspicious email requests. AI-generated audio and video have systematically broken this assumption, which means the entire verification model has to be rebuilt around something other than what the target sees and hears.

The collapse of barriers to entry is the second key fact. Voice cloning from a few seconds of source audio used to require specialized research labs. By 2024, commercial tools made it available to anyone, and open-source models pushed the cost essentially to zero. Real-time video deepfakes, the kind that work in live conversations, were considered years away as recently as 2022. The Arup case in early 2024, in which an employee was tricked into wiring $25 million after a video call where every executive present was a deepfake, demonstrated that "years away" had already arrived.

The conceptually important shift is to treat sensory verification as evidence that can be forged, just like email headers and signatures can be forged. The countermeasures that survive this shift are procedural and out-of-band: verifying through a channel the attacker did not provide, requiring multiple humans to approve high-impact actions, and using shared secrets or callback procedures that the deepfake cannot replicate.

How it works

A deepfake-enabled social-engineering attack follows a recognizable chain. First, the attacker selects a target organization and identifies a high-trust impersonation target, typically an executive or finance officer whose voice and face are publicly available through earnings calls, conference talks, podcast appearances, or company videos. As little as three seconds of clean audio is sufficient to produce a convincing voice clone with modern tools, and video deepfakes require only modest amounts of source footage.

Second, the attacker builds a pretext that fits the impersonated person's role. The most common pattern is a confidential, time-sensitive financial transaction, an acquisition, a settlement, a regulatory matter, that justifies bypassing normal approval procedures and asking the employee not to discuss it with colleagues. This pretext exploits two psychological levers simultaneously: authority (the request comes from someone senior) and urgency (there is no time for normal verification).

Third, the attacker initiates contact, often starting with text-based channels and escalating to voice or video when the target hesitates. The Hong Kong Arup attack and the Singapore variant that followed both used this pattern: an initial email raised suspicion, the target asked for verification, and the attacker offered a video call. The deepfake on the video call provided the verification the target was looking for, which is precisely why it was effective. The defensive instinct that should have caught the attack was the very thing the attacker had prepared for.

Fourth, the actual exploit happens after the deepfake has established trust. The target executes the requested wire transfer, shares credentials, or grants access. Funds move quickly through correspondent banks and money mules, and by the time the fraud is discovered, the money is typically beyond recovery. In the Arup case, 15 separate transactions totaling $25 million were processed before the fraud was identified.

Real-world impact

The data is no longer anecdotal. The FBI's 2024 Internet Crime Report attributed $2.77 billion in losses to AI-assisted business email compromise across more than 21,000 incidents. Deepfake vishing, voice phishing using cloned audio, surged over 1,600% in Q1 2025 compared to Q4 2024 in US incidents alone. Deloitte projected AI fraud losses against US enterprises to reach roughly $40 billion annually by 2027. Group-IB reported that more than 10% of surveyed financial institutions had suffered deepfake vishing losses exceeding $1 million per incident.

The case studies are equally concrete. In early 2024, an Arup employee wired $25 million after a video conference call in which every participant except the victim was a deepfake. In March 2025, a finance director at a Singapore firm authorized a $499,000 transfer after a similar multi-person deepfake video call. In 2024, Ferrari executives prevented an attack by asking the deepfake what book the real CEO had recently recommended, the deepfake could not answer and ended the call. WPP's CEO was impersonated through voice cloning in an attempted credential-harvesting attack. LastPass's CEO was impersonated through WhatsApp voicemails. Each case involved sophisticated multi-channel preparation and exploitation of voice or video as a verification layer the attacker had already broken.

The pattern across these incidents is consistent: the deepfake itself is rarely the only attack vector. It is the verification layer in a multi-stage attack that also includes phishing emails, compromised accounts, and pretexts crafted from OSINT. Successful defenses succeeded because of procedural controls, callback to a known number, shared secrets, multi-party authorization, that were independent of what the target saw and heard.

Warning signs

Patterns worth investigating further.
  • A senior executive contacts an employee through a channel that does not match their normal communication pattern (WhatsApp instead of email, personal phone instead of office line) with an urgent financial or access request.
  • A live video call exhibits subtle artifacts: unnatural eye movement, asymmetric facial responses, audio that does not perfectly sync with mouth movement, or reduced video resolution that conveniently hides finer details.
  • An urgent request is accompanied by explicit instructions not to verify with anyone else or to keep the matter confidential within a small group.

DEEP DIVE

Why voice cloning broke the callback defense

For most of the 2010s, business email compromise prevention training said the same thing: when in doubt, call them back. The reasoning was sound. Even if an attacker could spoof an email, they could not produce the executive's voice in real time. The phone call was a reliable second channel, and most organizations built their wire-transfer approval procedures around it.

Voice cloning broke this defense, and it broke it quickly. The technology timeline is worth understanding. In 2019, the first publicly documented voice-clone fraud, the case of a UK energy firm losing about $243,000, required hours of source audio and produced output that was passable but not seamless. By 2022, commercial tools could clone voices from minutes of audio. By 2024, McAfee and others demonstrated that three to five seconds of source material was sufficient to produce clones with 85% acoustic match to the original speaker.

The implication is that the callback defense now has two failure modes. First, the executive's voice can be cloned and used to answer a callback that the attacker has arranged to be redirected. Second, the attacker can initiate the voice call themselves, knowing the target will perceive a familiar voice as confirmation. Either way, hearing the voice no longer constitutes verification.

What survives this shift is callback to a known number using a phone book the recipient maintains independently of the message that prompted the call. If the email asks for a transfer and provides a callback number, that number is part of the attack. If the recipient calls the executive's known office number from memory or a verified internal directory, the attack chain breaks because the attacker does not control that line. This is a small but important distinction that many organizations still get wrong.

Real-time video deepfakes

Video deepfakes followed the same trajectory as voice clones, with about a two-year lag. The 2017 academic demonstrations required hours of compute time and produced output suitable only for prerecorded video. By 2023, real-time face-swap tools were available as commodity software. By early 2024, real-time multi-person deepfakes worked well enough to fool a finance worker on a multi-participant Zoom call, as the Arup case demonstrated.

The technical mechanism is conceptually straightforward. A model trained on enough images and video of the target can generate frames of the target's face making any expression, with any mouth movement, in any lighting. When combined with voice cloning and driven by a live human actor in real time, the result is a video stream where the actor's expressions and speech are mapped onto the target's appearance. The artifacts that betray the deepfake to a careful viewer, slightly inconsistent eye movement, asymmetric facial expressions, occasional desync between voice and lips, are subtle enough that most viewers in a typical work meeting will not notice them, especially under time pressure.

The Arup attack is worth studying in detail because it shows the attacker's full operational maturity. The victim was contacted first by email, raised suspicion, and asked to verify. The attacker then arranged a video call with multiple deepfaked executives, not just one, which provided social proof that overrode the initial doubt. The conversation continued long enough to seem authentic. By the time the call ended, the victim had been convinced thoroughly enough to process 15 separate transactions totaling $25 million. No single technical control could have caught this. The defense had to be procedural: a separate approval channel that was independent of the call itself.

The OSINT-to-deepfake pipeline

Deepfakes are not generated in a vacuum. They require source material, and that source material is almost always publicly available. Executives' voices are on earnings calls, conference talks, podcast appearances, and YouTube interviews. Their faces are in corporate biographies, news photos, and social media. Their writing style is in published articles and LinkedIn posts. Their typical phrases, mannerisms, and references are documented in years of public communication.

This is the same OSINT pipeline that powers AI-assisted phishing, repurposed to feed deepfake generation models. The Wiz CEO's voice was cloned from a publicly available conference talk. The Ferrari CEO's accent was reproduced from his many public appearances. The Hong Kong Arup executives were synthesized from corporate video material. Every executive who appears publicly is also providing training data for their own future impersonation.

This does not mean executives should disappear from public communication, that would be neither practical nor desirable. It means the defensive baseline must assume that any public-facing person can be deepfaked, and that the protections around access to that person's authority, financial signing power, system access, communication channels, must be designed accordingly. The right question is not "how do we prevent the deepfake" but "what process makes a deepfake insufficient to authorize the action."

What still works

Three categories of defense survive the deepfake shift, and they all share a property: they do not depend on the recipient distinguishing real from fake. The first is out-of-band verification, confirming any unusual or high-value request through a channel the recipient controls. If a phone call asks for a transfer, the recipient ends the call and dials the executive's known number, not a number provided in the call. If an email asks for credentials, the recipient confirms through a chat platform where the executive's identity is independently verified. The channel must be one the attacker did not provide.

The second is process-based controls, removing single-person authority over high-impact actions. Wire transfers above a threshold require two independent approvers. New vendor accounts require independent verification. Access escalations require a written request through a ticketing system rather than a verbal approval. These controls work because they require the attacker to compromise multiple independent humans simultaneously, which is significantly harder than producing a convincing deepfake of one.

The third is shared-secret verification, using information that only the real person would know and that is not present in public OSINT. The Ferrari executives stopped their attack by asking the deepfake about a recent book recommendation. More formal variants include pre-shared verification phrases between executives and their direct reports, or callback procedures that use information from internal systems the attacker cannot access. The principle is the same: introduce a question whose correct answer cannot be derived from public material.

Detection technology, deepfake-detection AI, voice-print analysis, video-artifact scanning, is improving rapidly but should not be relied on as the primary defense. The arms race between deepfake generation and deepfake detection currently favors generation, and the gap is not closing. The detection tools are useful as supplementary controls and as forensic aids after an incident, but the operational defense against deepfake-enabled fraud is procedural, not technical. Organizations that have not redesigned their high-trust approval workflows for a world where audio and video can be forged remain exposed regardless of which detection tools they buy.