VCTK全称是Centre for Speech Technology Voice Cloning Toolkit (CSTR’s VCTK Corpus),即语音克隆工具包。

1. 简介

数据是109 位英语母语人士(不同口音)。每位朗读大约 400 句子


1.1 《彩虹段落》(The Rainbow Passage)


  • 音素 - 包含了英语中广泛的音素(即语音的最小单位),包括元音、辅音以及它们的各种组合。

  • 发音变化 - 句子结构和内容设计使说话者需要不同的发音模式,展现语音多样性。

  • 语法: 段落包含了复杂的句法结构,如复合句和从句,有助于研究语音在不同语法环境下的表现。

  • 词汇: 涵盖了多种词汇和表达,适合测试语言的流畅性和准确性。

  • 内容: 涵盖彩虹相关自然现象、文化传说,历史解释、隐喻等。


When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow. The rainbow is a division of white light into many beautiful colors.  These take the shape of a long round arch, with its path high above, and its two ends apparently beyond the horizon. There is , according to legend, a boiling pot of gold at one end. People look, but no one ever finds it.  When a man looks for something beyond his reach, his friends say he is looking for the pot of gold at the end of the rainbow.

Throughout the centuries people have explained the rainbow in various ways. Some have accepted it as a miracle without physical explanation. To the Hebrews it was a token that there would be no more universal floods. The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain. The Norsemen considered
the rainbow as a bridge over which the gods passed from earth to their home in the sky.  Others have tried to explain the phenomenon physically. Aristotle thought that the rainbow was caused by reflection of the sun's rays by the rain. Since then physicists have found that it is not reflection, but refraction by the raindrops which causes the rainbows. Many complicated ideas about the rainbow have been formed.

The difference in the rainbow depends considerably upon the size of the drops, and the width of the colored band increases as the size of the drops increases. The actual primary rainbow observed is said to be the effect of super-imposition of a number of bows. If the red of the second bow falls upon the green of the first, the result is to give
a bow with an abnormally wide yellow band, since red and green light when mixed form yellow. This is a very common type of bow, one showing mainly red and yellow, with little or no green or blue.

1.2 方言识别段落(Elicitation Paragraph)



  • 特定的词汇、缩写,具有标志性的方言特征发音。

  • 音调和重音的变化,反映出特定方言的特征

  • 特定的语法结构、短语和非正式用语,以评估口音或方言在不同语境中的表现。


Please call Stella. Ask her to bring these things with her from the store: 
six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. 

We also need a small plastic snake and a big toy frog for the kids. 
She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.

2. 数据细节

2.1 数据格式

  • 录音

    • 使用一支全向麦克风(DPA 4035)和一支宽带宽小振膜电容麦克风(Sennheiser MKH 800)。

    • 录音的采样频率是96 kHz,24位深度,并在爱丁堡大学的半消声室中进行。

    • 异常情况: 有两名说话者(p280 和 p315)在使用MKH 800录音时出现了技术问题。

  • 转换

    • 所有录音都被转换为16位,并降采样至48 kHz。

    • 手动对录音进行了端点处理(即去掉了录音开始和结束的静音部分)。

  • 文本标注

    • 110个录音中的109个提供了对应的文本文件(转录文件),存储在’/txt’文件夹中。

    • 异常情况: 由于硬盘错误,‘p315’的文本丢失。

2.2 衍生版本

  • 原版VCTK (2019-11-13)

CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92)

version 0.92: 10.94 GB

  • Device Recorded VCTK (Small subset version, 2018-03-06)



DR-VCTK , 1.671 GB

  • Noisy Reverberant Speech Database (2017-09-14)



1. 将干净的语音信号与一个房间脉冲响应(Room Impulse Response, RIR)进行卷积。模拟语音在一个特定房间内的传播和反射,导致混响效应。

  2. 将干净的语音信号与一个RIR进行卷积,模拟噪声在房间内的传播和混响。

  3. 将经过混响处理的语音信号与经过混响处理的噪声信号相加,产生最终的“嘈杂和混响”的语音信号。
  • Noisy speech database


  • Reverberant speech database


  • 96kHz version of the CSTR VCTK Corpus



