![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
So, I just attended a talk by Carol Fowler, an academic who works in articulatory/gestural phonology. She is awesome and her talk was awesome - I was more impressed by it than I've been by most other talks, even though is only varely related to my usual interests. I'm now going to try to capture as much of it as I can remember, and since it's about cogsci-ish stuff I'm putting it up here where other people with similar interests can read it, if they want. It's quite rambly though and I don't know enough about phonetics to really explain most of it clearly, so if you want the cool/easier to understand bits, skip down to the embodied cognition/mirror neuron section.
Relevant jargon:
gestures: when applied to speech production refer to movements of the tongue, jaw, lips, etc.
formants: the frequences of speech sounds. More specifically, most speech sounds show up on a spectrogram/graph/whatever as 2 or moer lines at various heights which our brain combines to produce a sound.
VOT - voice onset timing: I'm not really sure what this is since it's outside my area but it's one of the qualities of the speech signal that lets you distinguish between different sounds. VOT's vary across demographics in much the way you would expect, which is to say that it varies between individuals but there are also broad gender/cultural trends.
TMS: transcranial magnetic stimulation - it's gotten popular recently as a way to shut off areas of the brain but apparently it's also what they use for stimulating specific muscles directly.
Stop - a consonant that involves full stoppage of the vocal tract. The most consonanty consonants that exist. Letters like p,t,g are stops.
Overall argument: a really big part of language perception relies on embodied cognition type stuff, because trying to reconstruct actual sounds from a continuous stream is really really hard.
Phonetics stuff: Speech signals are not consistent. For example, /gi/ versus /gu/, the formants are what you would expect for the vowel parts (/i/ is high, /u/ is lower), but the onsets that make people hear the /g/ look like a little downtick on the first one and a little uptick on the second. The short noisy burst that is a stop looks almost exactly the same regardless of whether it's a /p/,/t/ or whatever, so Liberman guessed that speakers must be using articulatory information to disambiguate them. The proof of this can be seen in the McGurk effect, where you listen to one syllable and see another being mouthed and what you perceive is usually somewhere between the two but closer to what you see. The McGurk effect has been replicated in multiple modalities, including one Carol recounted one where the listener put their hand over her face while she mouthed various syllables, and another one where people got a puff of air on their necks to mimic the aspiration of the /p/ in /pa/ versus the non-aspiration in /ba/. Writing is one of the few modalities that doesn't show an effect. When you put other syllables such as /ar/ or /al/ in front of an ambiguous /pa/ or /ga/, people hear it in a way that indicates that they're overcompensating for the effects of coarticulation of the two consonants. (the other theory had to do with formants, but a Tamil linguist finally found a minimal pair to test both theories and it came out in favour of overcompensation)
The peception by synthesis argument: Someone (Liberman?) thinks people understand speech by modelling possible gestures until they find the correct one. Carol disagrees with this because in perception you only have a very limited time to work out what they said, and your brain doesn't enjoy being wrong because that means more work, so trial and error seems unlikely. Also, no one actually speaks identically so there's no way I can model what you said accurately in any case (although this one's kind of a weak argument, because it's a matter of getting close enough)
Mirror neurons and embodied cognition stuff: People primed with thoughts about old people moved more slowly on their way to the lift afterwards. Subjects made to hold a pen between their teeth, forcing their mouth into a 'smile', were more likely to perceive other faces as smiling. In speech perception, if people had a machine pulling their mouths up or down to mimic the shape of their mouth when forming various vowels were more likely to hear ambiguous vowels as the one corresponding to their mouth shape, even though they weren't making that mouth shape deliberately. Similarly, when TMS was used to stimulate subjects lips or tongue while hearing various consonants, they were more likely to perceive the consonant made with that part of the mouth. People watching other people walk have short bursts of neural activity corresponding to leg muscle movement, this kind of thing doesn't happen when watching a movement that isn't humanly possibly, like wagging a tail. Similar effects in speech perception. When TMS was used to knock out part of the articulatory apparatus, people's speech perception suffered. When subjects had their jaw moved in a specific way by a machine such that it didn't actually affect how they produced a specific vowel, it still had an effect on how that vowel was perceived by those subjects later.
The chinchilla experiment: chinchillas kept in a US lab were successfully taught to distinguish between /pa/ and /ba/. More interestingly, the acoustic properties they were picking up were specific to English - apparently the way English speakers distinguish between the two sounds (something to do with VOT's) is fairly unusual, most languages put the boundary somewhere slightly different. So that's evidence against there being a special human phonetic module for speech perception. Other animals can do it too, chinchillas are just the silliest and therefore one of the strongest examples of it not being anything special. But Carol mentioned that she's skeptical of this experiment and would like to see it replicated.
Questions: Do blind people have correspondingly worse speech perception since they lack a lot of the cross-modal information which is apparently so important? Studies of the mirror neuron/embodied cognition stuff in signed languages.
Relevant jargon:
gestures: when applied to speech production refer to movements of the tongue, jaw, lips, etc.
formants: the frequences of speech sounds. More specifically, most speech sounds show up on a spectrogram/graph/whatever as 2 or moer lines at various heights which our brain combines to produce a sound.
VOT - voice onset timing: I'm not really sure what this is since it's outside my area but it's one of the qualities of the speech signal that lets you distinguish between different sounds. VOT's vary across demographics in much the way you would expect, which is to say that it varies between individuals but there are also broad gender/cultural trends.
TMS: transcranial magnetic stimulation - it's gotten popular recently as a way to shut off areas of the brain but apparently it's also what they use for stimulating specific muscles directly.
Stop - a consonant that involves full stoppage of the vocal tract. The most consonanty consonants that exist. Letters like p,t,g are stops.
Overall argument: a really big part of language perception relies on embodied cognition type stuff, because trying to reconstruct actual sounds from a continuous stream is really really hard.
Phonetics stuff: Speech signals are not consistent. For example, /gi/ versus /gu/, the formants are what you would expect for the vowel parts (/i/ is high, /u/ is lower), but the onsets that make people hear the /g/ look like a little downtick on the first one and a little uptick on the second. The short noisy burst that is a stop looks almost exactly the same regardless of whether it's a /p/,/t/ or whatever, so Liberman guessed that speakers must be using articulatory information to disambiguate them. The proof of this can be seen in the McGurk effect, where you listen to one syllable and see another being mouthed and what you perceive is usually somewhere between the two but closer to what you see. The McGurk effect has been replicated in multiple modalities, including one Carol recounted one where the listener put their hand over her face while she mouthed various syllables, and another one where people got a puff of air on their necks to mimic the aspiration of the /p/ in /pa/ versus the non-aspiration in /ba/. Writing is one of the few modalities that doesn't show an effect. When you put other syllables such as /ar/ or /al/ in front of an ambiguous /pa/ or /ga/, people hear it in a way that indicates that they're overcompensating for the effects of coarticulation of the two consonants. (the other theory had to do with formants, but a Tamil linguist finally found a minimal pair to test both theories and it came out in favour of overcompensation)
The peception by synthesis argument: Someone (Liberman?) thinks people understand speech by modelling possible gestures until they find the correct one. Carol disagrees with this because in perception you only have a very limited time to work out what they said, and your brain doesn't enjoy being wrong because that means more work, so trial and error seems unlikely. Also, no one actually speaks identically so there's no way I can model what you said accurately in any case (although this one's kind of a weak argument, because it's a matter of getting close enough)
Mirror neurons and embodied cognition stuff: People primed with thoughts about old people moved more slowly on their way to the lift afterwards. Subjects made to hold a pen between their teeth, forcing their mouth into a 'smile', were more likely to perceive other faces as smiling. In speech perception, if people had a machine pulling their mouths up or down to mimic the shape of their mouth when forming various vowels were more likely to hear ambiguous vowels as the one corresponding to their mouth shape, even though they weren't making that mouth shape deliberately. Similarly, when TMS was used to stimulate subjects lips or tongue while hearing various consonants, they were more likely to perceive the consonant made with that part of the mouth. People watching other people walk have short bursts of neural activity corresponding to leg muscle movement, this kind of thing doesn't happen when watching a movement that isn't humanly possibly, like wagging a tail. Similar effects in speech perception. When TMS was used to knock out part of the articulatory apparatus, people's speech perception suffered. When subjects had their jaw moved in a specific way by a machine such that it didn't actually affect how they produced a specific vowel, it still had an effect on how that vowel was perceived by those subjects later.
The chinchilla experiment: chinchillas kept in a US lab were successfully taught to distinguish between /pa/ and /ba/. More interestingly, the acoustic properties they were picking up were specific to English - apparently the way English speakers distinguish between the two sounds (something to do with VOT's) is fairly unusual, most languages put the boundary somewhere slightly different. So that's evidence against there being a special human phonetic module for speech perception. Other animals can do it too, chinchillas are just the silliest and therefore one of the strongest examples of it not being anything special. But Carol mentioned that she's skeptical of this experiment and would like to see it replicated.
Questions: Do blind people have correspondingly worse speech perception since they lack a lot of the cross-modal information which is apparently so important? Studies of the mirror neuron/embodied cognition stuff in signed languages.
no subject
Date: 2011-11-08 08:56 am (UTC)