LATEST VERSION 1.2 (24.11.2014)

Description

UTextTool module provides various text conversion capabilities such as encoding conversion, html markup removal, reading and saving files, create grammar files, and more.
 
Text Encoding
 
For historical reasons, international text is often encoded using a language or country dependent character encoding. With the advent of the internet and the frequent exchange of text across countries - even the viewing of a web page from a foreign country is a "text exchange" in this context -, conversions between these encodings have become important. They have also become a problem, because many characters which are present in one encoding are absent in many other encodings. To solve this mess, the Unicode encoding has been created. It is a super-encoding of all others and is therefore the default encoding for new text formats like XML.
You can find more information here:
UTextTool module utilizes libiconv library. Libiconv provides an implementation of the iconv() function and the iconv program for character set conversion. For use on systems which don't have one, or whose implementation cannot convert from/to Unicode. The text encoding function converts byte sequences from character encoding fromcode to character encoding tocode. For the libiconv library, the following encodings are supported, in all combinations.
European languages 
ASCII, ISO-8859-{1,2,3,4,5,7,9,10,13,14,15,16}, KOI8-R, KOI8-U, KOI8-RU, CP{1250, 1251, 1252, 1253, 1254, 1257}, CP{850, 866, 1131}, Mac{Roman,CentralEurope, Iceland,Croatian,Romania}, Mac{Cyrillic, Ukraine, Greek, Turkish}, Macintosh
Semitic languages
ISO-8859-{6,8}, CP{1255,1256}, CP862, Mac{Hebrew, Arabic}
Japanese
EUC-JP, SHIFT_JIS, CP932, ISO-2022-JP, ISO-2022-JP-2, ISO-2022-JP-1
Chinese
EUC-CN, HZ, GBK, CP936, GB18030, EUC-TW, BIG5, CP950, BIG5-HKSCS, BIG5-HKSCS:2001, BIG5-HKSCS:1999, ISO-2022-CN, ISO-2022-CN-EXT
Korean
EUC-KR, CP949, ISO-2022-KR, JOHAB
Armenian
ARMSCII-8
Georgian
Georgian-Academy, Georgian-PS
Tajik
KOI8-T
Kazakh
PT154, RK1048
Thai
TIS-620, CP874, MacThai
Laotian
MuleLao-1, CP1133
Vietnamese
VISCII, TCVN, CP1258
Platform specifics
HP-ROMAN8, NEXTSTEP
Full Unicode
UTF-8, UCS-2, UCS-2BE, UCS-2LE, UCS-4, UCS-4BE, UCS-4LE, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-7, C99, JAVA
Full Unicode, in terms of uint16_t or uint32_t (with machine dependent endianness and alignment)
UCS-2-INTERNAL, UCS-4-INTERNAL
Locale dependent, in terms of char or wchar_t (with machine dependent endianness and alignment, and with semantics depending on the OS and the current LC_CTYPE locale facet) char, wchar_t
 
Creating simple grammar files
 
One of the UTextTool functionalities is creating simple grammar files (grxml). It is very useful for modules utilizing Microsoft Speech Platform - speech recognition. The Microsoft Speech Platform SDK provides programmatic processes for authoring speech recognition grammars and also offers support for XML-format grammars authored in compliance with industry standards. Grammars are at the core of speech recognition and are perhaps the most important component under control of the speech application developer that affects the accuracy of speech recognition. Grammars work in conjunction with the speech recognition engine and its lexicons and speech models to define the factors that affect speech recognition performance.
 

Requirements

UTextTool was compiled with shared libiconv library (included in utexttool1x.zip package). Copy it to the uobjects folder or set path in system environment variable (PATH). 
  • libiconv.dll

Module functions

UTextTool.new() - create an instance of the module,
UTextTool.encodingMode - set encoding mode: "", "translit" or " ignore".
When "" (empty) mode is set, encoding function reports an error when a character cannot be represented in a given encoding.
When "translit" mode is set, when a character cannot be represented in the target character set, it can be approximated through one or several characters that look similar to the original character.
When "ignore" mode is set, characters that cannot be represented in the target character set will be silently discarded.
"encoded_text" = TextTool.ConvertEncoding("text","from_encoding","to_encoding"); - encode text,
"cleaned_text" = TextTool.RemoveHTML("text"); - remove html markups,

UTextTool.SaveToFile("text...","file.txt") - save to file,
["line1","line2",...] = UTextTool.ReadFromFile("file.txt") - read from file,

UTextTool.CreateSimpleGrxml(["phrase1","phrase2",...],"lang","file.grxml) - create simple grammar file,

UTextTool.DayOfTheWeek("2014-11-28") - get day of the week from urbi date.

Urbiscript example

loadModule("UTextTool");
var Global.TextTool = UTextTool.new();
TextTool.encodingMode="translit";
TextTool.ConvertEncoding("Na ok\xc5\x82adce magazynu znajdzie si\xc4\x99 Ma\xc5\x82gorzata Ko\xc5\xbcuchowska.","utf-8","iso-8859-2");
[0001443532]"Na ok\xb3adce magazynu znajdzie si\xea Ma\xb3gorzata Ko\xbfuchowska."
TextTool.RemoveHTML("<h1 align=center>ICONV_OPEN</h1><a href=\"#NAME\">NAME</a><br><a href=\"#SEE ALSO\">SEE ALSO</a><br>");
[0000186900]"ICONV_OPENNAME\r\nSEE ALSO\r\n"
TextTool.SaveToFile("Hello world!","text.txt");
TextTool.CreateSimpleGrxml(["red color","green color","blue color"],"en-EN","test.grxml");

Download

LINK

 

 

EMYS and FLASH are Open Source and distributed according to the GPL v2.0 © Rev. 0.8.0, 27.04.2016

FLASH Documentation