Do Machines Dream of Musicians: The Ethical Ramifications of Voice Synthesis in an Era of Vocaloid and AI Music.

SEBANTI HUI 2333168
Dec 18, 2024
6 min read

A few decades ago, we would not have been able to imagine that we would be listening to machines singing and performing in the near future and yet, that is our current reality. When we say that “Technology is taking over the world!”, we rarely consider the authenticity of this statement. It might be humorous to imagine a scenario- where humans take the backseat in terms of performance and creativity, and watch robots perform theatrics. People in the 90s would undoubtedly consider such a statement to be some inane postmodern fantasy and yet, the 21st century has proved that such conceptions are truly possible. Hatsune Miku has made it possible.

Voice and Speech Synthesis

Speech synthesis, in rudimentary terms, is the artificial production and reproduction of human speech. One would undoubtedly be familiar with this technology through the text-to-speech features that are embedded in our electronic devices and in a more familiar example, through SIRI and Google Assistant. Speech synthesis is a highly linguistic process that involves the contextual absorption of the text, its conversion into phonemes and its eventual mapping into comprehensive sounds in a textual sequence using human-recorded voices. While the concept certainly seems dystopian, speech synthesising programs are a necessary component of our quotidian lives, especially due to its usage in accessibility features such as screen-readers and translation tools.

Vocal Synthesisers, commonly known as Vocal Synths, are musical programs which enable the user to instantaneously generate vocals by combining the various aspects of human speech. The process is indistinguishable from the ordinary utilisations of speech synthesis and the user must merely input a string of text that is lyrical in nature with certain specifications to manipulate the output. The users can control a generated tone and warp it to fit their preferences. It is a more modernised rendition of speech synthesis that primarily places an onus on performance and entertainment than accessibility.

There exist several softwares dedicated to vocal synthesis such as Synthesizer V/SynthV and Plogue’s Alter Ego. However, despite the existence of these modern softwares, with better artificial augmentation and AI-features, Yamaha’s Vocaloid, continues to be hailed as the most-utilised synthesising software. Initially created as a collaborative project between Yamaha Corporation and the Music Technology Group, Vocaloid has become the most commercially successful synthesising software due to the inherent mascotification and “moe anthropomorphism” of its voice-providers. Vocaloid employs a unique tactic of creating specific individualistic identities for the voice-banks that it offers, such that these additions are less of an enhancer, and more of brand mascots. For instance, Hatsune Miku, the face of Vocaloid who is inseparable from her blue-haired, twin-tailed image, borrows her voice from the Japanese actress, Saki Fujita. Mascot and icon culture has proven to be a successful business model so Vocaloid’s quick success is certainly understandable. Moreover, the program has advanced significantly since its conceptualisation, extending to the provision of voicebanks in languages such as Mandarin, Korean and Spanish.

Death of the Singer?

The primary point of contention arises at the question of ownership rights over one’s voice and the ethics of synthesising music with these softwares. Whether one voluntarily provides the rights to their voice inputs being utilised as data for these voice banks, or they are somehow unaware, there must exist a sense of transparency between the provider and the user. Unless there exists a formal agreement or contract between the developers, users and voice-providers, synthesised works can be considered to be infringements of intellectual property rights and individual rights.

In ordinary circumstances, a company cannot sample an employee’s voice without the employee’s consent. In the case of Vocaloid, there does exist a formal agreement with the voice-providers. However, such an arrangement seems to bode well with Vocaloid due to its inherent creation of mascots, which separates the art from the original provider. For instance, a song sung by Hatsune Miku would be credited to the producer of the song and Hatsune Miku, not Saki Fujita. Such groundwork can prevent possibilities of malicious assaults on one’s dignity. On the other hand, it can lead to the emergence of another crucial question: Who is the singer?

Once the character of Hatsune Miku has been created, does the original voice-provider no longer fall into any equation pertaining to the composition of music? In that case, how does one allocate the revenue earned from these musical scores and performances between these parties? Again, royalty payments do exist for softwares and companies that have priorly created an agreement, but not so much for softwares that negates such an arrangement entirely. For instance, several users have created AI remixes and compositions of K-Pop (Korean Pop) music, which primarily emerge in fandom spaces, not official ones. Such renditions amass a large amount of views and revenue, although the voice-providers are seemingly unaware of such compositions. The same has been done in the case of deceased artists.

The main issue that arises from the AI-generation of music using softwares that has no forms of transparency or legal agreements with the musicians whose voices have been sampled is that it negates the creativity of human composers and musicians, especially when such works are monetised. Besides being a daunting legal issue that could entail potential lawsuits, it is also a contributor to misinformation that is readily absorbed by the collective consciousness of the public. In 2023, it was discovered that the rapper, Jay Z’s voice was being cloned by AI-music generation softwares and the question of the role of the singer in such an AI cesspool was once again forefronted. At the end of it all, all we can ask ourselves is, who is truly singing? The man or the machine?

Can it Be Controlled?

Having discussed the typical ethical dilemmas that can arise by virtue of the nature of the phenomenon, it would be incorrect to sideline the positive possibilities that AI music softwares can offer to budding musicians. As any AI-tool is concerned, the prospects of collaboration are endless.

From the very offset, certain features such as autotune has been utilised endlessly to the point where mainstream music cannot function without it, and it has certainly rendered itself to be successful. There is nothing inherently incorrect with modifying one’s own voice for creative purposes. Moreover, autotune and AI music-generation softwares do enable creators who are more inclined towards other creative pursuits such as game development and filmmaking to incorporate unique auditory elements into their works, provided that there is a transparency in the legalities behind the process. For instance, several Vocaloid producers tend to “collaborate” with the virtual musicians, combining human vocals with AI ones. In fact, there is an entire game, Project Sekai, based around this concept. Perhaps it is on account of Vocaloid’s unique identity as virtual animated singers, with their distinct identity that diverges from their voice-providers that enables such collaborations.

Naturally, when one listens to music hermeneutically, it becomes difficult to not focus on the socio-cultural nuances of using AI in creative pursuits, especially when music has historically been a method of expression and creating deeper cultural ties. However, it also becomes imperative to consider the heightened levels of accessibility that is offered to creatives because of these new innovations. Songmaking and composition has become streamlined, cost-effective and widespread, without relying upon older, elitist conventions. Human input is still required to a large degree to produce these songs, especially in terms of songwriting and coding, and compositions which are instantaneous tend to fall short in terms of quality.

There can be efforts made by policymakers to safeguard the intellectual rights of a person’s voice to prevent unlawful applications of the same. Discourses pertaining to copyright laws and artificial intelligence have persisted for decades and such discussions will not dissipate any time soon unless we are thinking about redefining our policies.

Conclusion

One of the biggest challenges that we face as human beings is the possibilities of change and adaptation. While we tend to dream big, of flashing images of robots becoming as humanistic as possible, our apprehensions of the uncanny supersede our awe. AI music is certainly one of those fields where we have ideated greatly, but are hesitant to move forward with, especially with the surge in virtual entertainment to accompany it. However, as far as any discourse with AI is concerned, collaboration and the cautious utilisation of these tools is generally the key to technological development. Innovations will persist and continue to reach greater heights, but unless we adapt ourselves, our society and laws to counter the changing times, can we truly lament the existence of these new forms of entertainment?

Do Machines Dream of Musicians: The Ethical Ramifications of Voice Synthesis in an Era of Vocaloid and AI Music.

Recent Posts

Comments