Post by hydrophilic on Oct 10, 2015 8:07:00 GMT
Using SAM in machine/assembly language is very easy. In both the original C64 version, and my Alpha 128 version, all you need to do is a copy a "$9b" terminated string into a buffer at $9415 (bank 1 on the C128) and then JSR $9a03 (bank 1 on the C128). The buffer is 256 bytes. Because you need a $9b terminator, the max string length is 255 characters.
The string must be in SAM's phonetic language. (On the C64 version, you can use English text, but you must JSR $9a09 ... not yet working on C128.)
Well that explains how to USE it, but what I want to discuss is how it WORKS. It is quite sophisticated, which is why I made this forum post (hopefully other hackers can comment?)
PART I - Preprocessing
This is a VERY long discussion about transformation from input (phonetic) text into audio playback. Skip to PART II for details about audio rendering.
SAM makes multiple passes over the input string... so many, in fact, that I am surprised it operates at acceptable speeds on a lowly 8-bit / 1MHz CPU! It does the following things:
- Initialize (setup SID, save ZP, disable IRQ, and blank VIC)
- "Crunch" the SAM text into a pair of phoneme codes and stress codes
- Expand phonemes
- Apply stress
- Initialize pitch/duration (per phoneme)
- Expand "plosives"
- Validate phoneme string
- Expand punctuation
- Speak all phrases in string
Item #1 is constant; items 2~9 require parsing the entire input sting!
For #2, you should know that most of SAM's phonetic language relies on two characters per phoneme... Step 2 reduces all phonemes to a one-byte "phoneme code" (but also creates a new "stress table"). BONUS! Just for reading this, I give you my table extracted from SAM(64) for your education/amusement:
; 'PHONEMES'
An example(s) would probably be helpful... so a "space" would be translated into "code" $00, an "LX" would translate to "code" $13, and "UM" would translate into "code" $4f. Although most codes need 2 characters (like LX or UM), some only need 1 character (like P*). Finally note there are many missing/secret codes (shown by **). These are not available to the user -- they are generated internally for "plosive" phonemes (see step #6).
; 0 1 2 3 4 5 6 7 8 9 a b c d e f
;00 ' ' '.' '?' ',' '-' IY, IH, EH, AE, AA, AH, AO, UH, AX, IX, ER
;10 UX, OH, RX, LX, WX, YX, WH, R*, L*, W*, Y*, M*, N*, NX, DX, Q*
;20 S*, SH, F*, TH, /H, /X, Z*, ZH, V*, DH, CH, **, J*, **, **, **
;30 EY, AY, OY, AW, OW, UW, B*, **, **, D*, **, **, G*, **, **, GX
;40 **, **, P*, **, **, T*, **, **, K*, **, **, KX, **, **, UL, UM
;50 UN
In #3, a few rare phonemes (UL, UM, UN) are expanded from a one-byte code into two codes (each one-byte).
In #4, any "stress codes" in the input are applied to the new "stress table".
In #5, yet another table is created which I call "pitch", for lack of better name (it may actually be 1/frequency = time... umm duration?)
At this point, there are 3 tables derived from the input string: phoneme codes, pitch/duration, and stress.
Step #6 scares me... it expands "plosive characters" (such as T, P, K) from 1-byte character into 3-byte "pseudo-phonemes". It scares me because there is no check for buffer overflow! Anyway, just know that "plosive" sounds are 3x more complex than vowel-sounds (A, E, I, O, U, etc.)
Step #7 just checks that all "codes" are valid (and it is sloppy = not bullet-proof).
Step #8 just adds delays between phonemes: spaces are ignored, comma(,) and dash(-) delay speech, and specially processing is done for period(.) and question(?). The last two (period, question) modify the stress and pitch of preceding phoneme(s).
Now, are you lost? If you are like me, your head is spinning about this time!
Now things get MORE complex!
So now loop through each "phrase". For each phrase we do this:
- Copy phoneme "code", "pitch", and "stress" to a "phrase buffer"
- For each phoneme in the "phrase buffer" do this:
- Generate 8 tables
- Oscillate between "phoneme conjunction" and "sample playback"
- Generate 8 tables
- Conjunction: the current phoneme is compared with next... a "weight" is assigned to the current phoneme
- "Playback" converts a byte into 8 PCM audio samples (variable volume)
- "Playback" will continue with the next bytes in table (back to step b) if the duration is long enough (more than 8 samples)
- Advance to next phoneme (restart at phase "a")
OK, I must confess/apologize if that is not very clear... this is mainly because I do not fully understand SAM's ML code!! Please analyze the executable code and FLAME ME where I am wrong.
If all that didn't blow your mind, also note that in step "ii.b" (playback of PCM codes) that SAM will play 1 of 5 different "PCM" tables. I really don't understand how SAM can generate over 50 phonemes from only 5 tables... I assume it has to deal with amplitude modulation and/or frequency modulation.
In terms of frequency modulation, SAM could (not saying it does for sure) step through the PCM tables "faster" than 1-byte per sample (for example, if SAM skipped every odd address [2-byte per sample], then it would play at 2x frequency).
In terms of amplitude modulation, SAM definitively sets two different "amplitude" values for each PCM bit. For example, a phoneme may be rendered with "4,10,10,10,4,10,4,10" in one case, but with "4,13,13,13,4,13,4,13" in another case. This is discussed more below.
Part II - Rendering
Once all the pre-processing is done, the SID chip is finally ready to make some sound!
I always thought/assumed that SAM used a combination of SID waveforms + filtering + volume control to speak to us. After analyzing the code, I was shocked to learn that it only uses simple "bit banging" on the SID's volume register. Often called "digi" playback in Commodore literature.
I don't think "digi playback" is officially defined for CBM, but usually it involves sending a "random" sequence of 4-bit values to the SID volume register ($d418). This is (my opinion) 4-bit audio. However, SAM does not send "random" 4-bit codes to SID's $d418 register. Instead, it alternates between 2 fixed values while playing an "8-bit" sample...
For example: 4,10,10,10,4,10,4,10
Note this is just an 8-bit pattern (0,1,1,1,0,1,0,1) of two 4-bit values (4 and 10). If you do the math, you will see there are 16 bits involved: an 8-bit pattern, and two 4-bit levels. In other words, 16 bits are used to create 8 samples. This equates to 2 bits/sample (the same as Media Player 128 in simple videos).
It is important to note that the delay between "binary" 1/0 (might actually be 4/10, see above) is controlled by (fixed) software delays. Because the delays are in software, SAM will sound like a chipmunk if run at 2MHz on the C128. I can only think of two ways to fix this: re-write SAM to use hardware timers (yuck), or continue to use software delays (but adapt for 2MHz).
On a related note, the delay between samples can be changed... this is SAM's tone/pitch/frequency. It is a global variable and is easy to change with a simple POKE to $9a0f. (You can also change global speed with a POKE to $9a0e.)
Once again, sorry if that doesn't make a lot of sense... SAM is more complex than I originally thought.... I don't understand it all yet, but hope you hackers have something to say about it. For reference, here is the sam128.bin (9.53 KB), and here are my Sam128-Notes.txt (26.48 KB) on this project.
I hope this gets your creative juices flowing, and as always, appreciate any feedback. Anyway, happy Halloween