If you want to double-check that the package you have downloaded matches the package distributed by CRAN, you can compare the md5sum of the.exe to the fingerprint on the master server. Jun 13, 2019 When you install R with SQL Server, you get the same R tools that are standard to any base installation of R, such as RGui, Rterm, and so forth. What is my public ipv4 address. These tools are lightweight, useful for checking package and library information, running ad hoc commands or script, or stepping through tutorials. Presentation and customization of the R Graphical User Interface (Rgui).
Installing R R-Commander is a graphical user interface for R, a statistical environment comparable to SPSS or SAS. R and R-Commander are shareware applications. It is going to be used in the VUSN statistics class starting Spring of 2019. On Windows, launch the R GUI by choosing Start Menu MRO MRO Rgui 4.0.2 64 bit. The command-line application remains R in order to maintain compatibility with third-party applications that connect to R. Installing Microsoft R Open will overwrite any existing R application. If you use RStudio, there’s no special configuration required.
R internally allows strings to be represented in the current nativeencoding, in UTF-8 and in Latin 1. When interacting with the operatingsystem or external libraries, all these representations have to be convertedto native encoding. On Linux and macOS today this is not a problem, becausethe native encoding is UTF-8, so all Unicode characters are supported. OnWindows, the native encoding cannot be UTF-8 nor any other that couldrepresent all Unicode characters.
Windows sometimes replaces characters by similarly looking representableones (“best-fit”), which often works well but sometimes has surprisingresults, e.g. alpha character becomes letter a. In other cases, Windowsmay substitute non-representable characters by question marks or other and Rmay substitute by
UXXXXXXXX or other escapes. A number offunctions accessing the OS consequently have complicated semantics andimplementation on Windows. For example,
normalizePath for a valid pathtries to return also a valid path, which is a path to the same file. In anaive implementation, the normalized path could be non-existent or point toa different file due to best-fit, even when the original path is perfectlyrepresentable and valid.
This limitation of R on Windows is a source of pain for users who need towork with characters not representable in their native encoding. R provides“shortcuts” that sometimes bypass the conversion, e.g. when reading UTF-8text files via
readLines, but these work only for selected cases, whenexternal software is not involved and their use is difficult.
Finally, the latest Windows 10 allows to set UTF-8 as the native encoding. Rhas been modified to allow this setting, which wasn’t hard as this has beensupported on Unix/macOS for years.
The bad news is that the UTF-8 support on Windows requires Universal CRuntime (UCRT), a new C runtime. We need a new compiler toolchain and haveto rebuild all external libraries for R and packages: no object files builtusing the older toolchains (RTools 4 and older) can be re-used.
UCRT can be installed on older versions of Windows, but UTF-8 support willonly work on Windows 10 (November 2019 update) and newer.
The rest of this text explains in more detail what native UTF-8 supportwould bring to Windows users of R. This text is simplifying out a number ofdetails in order to be accessible to R users who are not package developers.An additional text for package developers and maintainers of infrastructuresto build R on Windows is providedhere,with details on how to build R using different infrastructures and how tobuild R with UCRT.
A demo binary installer for R and recommended packages is available (a linkappears later in this text) as well as a demo toolchain, which has a numberof libraries and headers for many but not all CRAN/BIOC packages.
Implications for RGui
RGui (RStudio is similar as it uses the same interface to R) is aWindows-only application implemented using Windows API and UTF-16LE. In R4.0 and earlier, RGui can already work with all Unicode characters.
RGui can print UTF-8 R strings. When running with RGui, R escapes UTF-8strings and embeds them into strings otherwise in native encoding on output.RGui understands this proprietary encoding and converts to UTF-16LE beforeprinting. This is intended to be used in all outputs R produces for RGui,but the approach has its limits: it becomes complicated when formatting theoutput and R does not know yet where it will be printed. Many corner caseshave been fixed, some recently, but likely some are remaining.
RGui can already pass Unicode strings to R. This is implemented by anothersemi-proprietary embedding, RGui converts UTF-16LE strings to the nativeencoding, but replaces the non-representable characters by
Uescapes that are understood by the parser. The parser will then turn theseinto R UTF-8 strings. This means that non-representable characters can beused only where
U escapes are allowed by R, which includes Rstring literals where it is most important, but such characters cannot beeven in comments.
As a side note here, I believe that to keep international collaboration onsoftware development, all code should be in ASCII, definitely all symbols,and I would say even in English, including comments. But still, R is usedalso interactively and this is a technical limitation, not an intentionallyenforced requirement.
For example, one can paste these Czech characters into Rgui:
ěščřžýáíé.On Windows running in a English locale:
This works fine. But, a comment is already a problem:
Some characters are fine, some are not.
In the experimental build of R, UTF-8 is the native encoding, so RGui willnot use any
U escapes when sending text to R and R will not embedany UTF-8 strings, because the native encoding is already UTF-8. Theexample above then works fine:
UTF-8 is selected automatically as the encoding for the current locale inthe experimental build:
Note that RGui still needs to use fonts that can correctly represent thecharacters. Similarly, not all fonts are expected to correctly displayexamples in this text.
Implications for RTerm
RTerm is a Windows application not using Unicode, like most of R it isimplemented using the standard C library assuming that the encoding-specificoperations will work according to the C locale. In R 4.0 and earlier, RTermcannot handle non-representable characters.
We cannot even paste non-representable characters to R. They will beconverted automatically to the native encoding. Pasting “ěščřžýáíé” resultsin
For the Czech text on Windows running in English locale, this is not so bad(only some diacritics marks are removed), but still not the exactrepresentation. For Asian languages on Windows running in English locale,the result is unusable.
In principle, we can use the
U sequences manually to inputstrings, but they still cannot be printed correctly:
The output shows that the string is correct inside R, it just cannot be printedon RTerm.
In the experimental build of R, if we run cmd.exe and then change the codepage to UTF-8 via “chcp 65001” before running RTerm, this works as it should
This text is not going into details about where the characters exactly getconverted/best-fitted, but the key thing is that with the UTF-8 build andwhen running cmd.exe in the UTF-8 code page (65001), without anymodification of RTerm code, RTerm works with Unicode characters.
As with RGui, the terminal also needs apropriate fonts. The same examplewith a Japanese text:
This example works fine with the experimental build on my system, but withthe default font (Consolas), the characters are replaced by a question markin a square. Still, just switching to another font, e.g. FangSong, in thecmd.exe menu, the characters appear correct in already printed text. Thecharacters will also be correct when one pastes them to an application thatuses the right font.
Implications for interaction with the OS
R on Windows already uses the Windows API in many cases instead of thestandard C library to avoid the conversion or to get access toWindows-specific functionality. More specifically, R tries to always do itwhen passing strings to the OS, e.g. creating a file with anon-representable name already works. R converts UTF-8 strings to UTF16-LE,which Windows understands. However, R packages or external libraries oftenwould not have such Windows specific code and would not be able to do that.With the experimental build, these problems disappear because the standard Cfunctions, which in turn usually call the non-unicode Windows API, will useUTF-8.
A different situation is when getting strings from the operating system, forexample listing files in a directory. R on Windows in such cases uses theC, non-unicode API or converts to the native encoding, unless this is adirect transformation of inputs that are already UTF-8. Please see Rdocumentation for details; this text provides a simplification of thetechnical details.
In principle, R could also have used Windows-specific UTF-16LE calls andconvert the strings to UTF-8, which R can represent. It would not be thatmuch more work given how much effort has been spent on the functions passingstrings to Windows.
However, R has been careful not to introduce UTF-8 strings for things theuser has not already intentionally made UTF-8, because of problems that thiswould cause for packages not handling encodings correctly. Such packageswill mysteriously start failing when incorrectly using strings in UTF-8 butthinking they were in native encoding. Such problems will not be found byautomated testing, because tests don’t use such unusual inputs and are oftenrun in English or similar locales.
This precaution came at a price of increased complexity. For example, the
normalizePath implementation could be half the code size or less if weallowed introducing UTF-8 strings. R instead normalizes “less”, e.g. doesnot follow a symlink if it helps, but produces a representable path name forone that is in native encoding.
With UTF-8 as the native encoding, these considerations are no longerneeded. Listing files in a directory when not-representable is no longer anissue (when valid Unicode) and it works in the experimental build withoutany code change.
Another issue is with external libraries that already started solving thisproblem their way, long before Windows 10. Some libraries bypass anyexternal code and the C library for strings and perform string operations inUTF-8 or UTF-16LE, sometimes with the help of external libraries, typicallyICU.
When R interacts with such libraries, it needs to know which encoding thoselibraries expect, and that sometimes changes from native encoding to UTF-8as the libraries evolve. For example, Cairo switched to UTF-8, so R had tonotice, and had to convert strings for newer Cairo versions to UTF-8 but forolder versions to native encoding.
Such change is sometimes hard to notice, because the type remains the same,
char *. Also handling these situations increases code complexity. Onehas to read carefully the change logs for external libraries, otherwiserunning into bugs that are hard to debug and almost impossible to detect bytests, as they don’t use unusual characters. Such transitions of externallibraries will no longer be an issue with UTF-8 being the native encoding.
Implications for internal functionality
R allows multiple encoding of strings in R character objects, with a flagwhether it is UTF-8, Latin 1 or native. But, eventually strings have to beconverted to
char * when interacting with the C library, the operatingsystem and other external libraries, or with external code incorporated intoR.
Historically, the assumption was that once typed
char *, the strings arealways in one encoding, and then it needs to be the native encoding. Thismakes a lot of sense as otherwise maintaining the code becomes difficult,but R made a number of exceptions e.g. for the shortcut in
readLines, andsometimes it helps to keep strings longer as R character objects. Still,sometimes the conversion to native encoding is done just to have a
char *representation of the string, even though not yet interacting withC/external code. All these conversions disappear when UTF-8 becomes thenative encoding.
One related example is R symbols. They need to have a unique representationdefined by a pointer stored in the R symbol table. For any effectiveimplementation, they need to be in the same encoding, which now is thenative encoding. A logical improvement would be converting to UTF-8,instead, but that would have potentially non-trivial performance overhead.These concerns are no longer necessary when UTF-8 becomes the nativeencoding.
In R 4.0, this limitation has as undesirable impact on hash maps:
On Windows, this produces a hash map with just a single element named “a”,because
α) gets best-fitted by Windows to letter “a”. With theexperimental build, this works fine as it does on Unix/macOS, adding twoelements to the hash map. Even though using non-ASCII variable names isprobably not the right thing to do, a hash map really should be able tosupport arbitrary Unicode keys.
The experimental build of R can be downloaded fromhere. It hasbase and recommended packages, but will not work with other packages thatuse native code. The experimental toolchain allows to test more packages(but not all CRAN/BIOC), more information is availablehereand may be updated without notice (there is always SVN history of it). Notfor production use.