Sometimes I need to tweak one or two of my extension methods. Either the needs of the project at hand change or somebody suggests a better way to do something. In lieu of this, here is my latest collection of String extension methods. For the most part, they are simple one-liners with a few statements that allow string manipulation operations that are typically required when parsing fully-qualified file paths, text files, data files and code files.
Let us start with two of the most popular string manipulation methods: converting a string into an array and reversing a string.toArray converts the string into an array object, providing indexed access to a string. Yes, there is the charAt function, but there are advantages to turning a string into an array: it can save some typing, it improves code readability and it allows the use of array intrinsic (and extension) methods, such as the reverse method, next.
reverse simply wraps the Array object's reverse method and glues it back with its join method. Think about getting file names and extensions from fully-qualified paths.
Moving on, the words method returns an array using white space as the split separator. Do notice that punctuation marks become part of a "word".
In turn, compact converts the words array into a string separated by a single space character, effectively "compacting" white space found on the the string.
As a side note, using /\s/ as the split's argument is the same as using /\s+/: the '+' seems to be implied when using a regular expression, ergo two or more contiguous separators are considered a single separator. If you use a string as the argument for the split though, two or more contiguous separators are considered two or more separators, and "holes" are created in the returned array:
Personally, I would rather have both /a/ and "a" return the holes and /a+/ eat contiguous a's. I will keep using the plus sign as a matter of rigor and make sure I tell someone at the ECMA about this, but let us get back on track with some lighter matters.
Sometimes it might be too costly to turn the string into an array, i.e., when all we want is the first or last character, hence fChar and lChar are just good shortcuts to the charAt function:
There have been questions about the performance of this method with large strings. I have put it to the test, and the results are in: a 1 MB file replicated about 700 times (the most I could handle before running out of memory) takes about 0.7 seconds on average. Some other processing and outer loops may cost you some performance when calling this function, but unless you are parsing an entire directory tree with thousands of files looking for things to replicate a thousand times, the times extension method should do fine!
Next, the box method wraps a string and every newline character found in it with the passed-in string (or the default '|' symbol). I use it to detect white space when testing and debugging my scripts.
l[eft]trim and r[ight]trim build a regular expression object with the passed-in string and flags. The string has a default value of "\s+" (white space), and the flags default to none (only 'g', 'm' and 'i', or any combination thereof are allowed -but not checked). With no arguments, ltrim and rtrim eliminate both horizontal and vertical white space at the beginning and the end of a string, respectively.
m[ultiline]trim (or m[iddle]trim if you will) is a trim variant that trims both horizontal and vertical white space by default, by forcing the global and multiline flags to ltrim and rtrim. This way it effectively eliminates empty lines in a multiline string, as shown below in hello3. This may be important in some scenarios, e.g., when parsing a code file: empty lines can be eliminated to save some memory or skipped in a loop to improve efficiency.
On the other hand you can pass in a string or a regular expression (making sure you double-escape the backslash character). In the example, hello4 passes in a request for trimming horizontal white space only, effectively preserving empty lines. This may be important in some other scenarios, e.g., when highlighting the syntax of a code file: empty lines are important visual clues. hello5 trims some punctuation marks and parentheses along with white space for each line.
Finally, trim simply calls both ltrim and rtrim with the default flags and passes on what to trim.
So far we have a character split and a "word" split; how about a line split? Sometimes you need to process a line of text at a time, e.g. when the operation on a line depends on the state of the previous line.
Unfortunately as explained in the side note about the split method, splitting both hello3 and hello4 above with /\n/ results in all contiguous newline characters becoming a single separator. This is sometimes desirable, so the method takes an argument that will compress vertical white space, but it defaults to preserving it with a negative lookahead expression:
Notice that the Array join method is used above for illustration purposes only. Also, I have added provisions for carriage return characters.
If you are still with me, you can probably see how a white space replace (such as the one in trim) and a split/join could be used to achieve the same results (in fact, compact was my original multiline trim). The replace performance should be unnoticeably better in most applications, but in general, I like to avoid using split unless it is absolutely neccessary.
As an apart, I myself cannot find a good reason to preserve trailing white space in a line or before the EOF; I seriously believe that banning it (or at least pre-processing it) would give the Internet a boost—think of all the trailing white space that gets emailed each day.
Text parsing and formatting (a.k.a. beautifying) usually requires some indentation methods. The indentation method returns the number of tabs at the beginning of the string, the length of RegExp.$1, which gets updated after the match method call.
The unindent method simply wraps a left trim with no arguments, but this is one of those ocassions in which defining a method improves code redeability.
The indent method returns the unidented string indented by a passed-in number of tabs.
Incidentally, you can define methods to be synonyms of each other using this syntax instead:
But then again calling unindent with arguments (e.g., unindent("[th]", "i")) can deteriorate code readability.
Finally, syntax highlighting involves tagging keywords and blocks of code (such as string literals or numbers) so that a rendering engine (such as HTML) can change the formatting of such keywords and blocks. A problem usually arises when the input code uses the highliter's tag characters. In such cases the tag characters must be "escaped". Another problem arises when you literally need to read an escaped tag, in which case you must first "unescape" the input:
So there you have it. These methods—along with my array object extension methods—are essential to the file and directory parsing operations that I have to deal with on a day to day basis, whether it is cleaning up temporary files, migrating data files, parsing code files, crawling and scraping web pages, highlighting text found in files, smart indenting, justifying and tabifying scripts... you name it. Some of the methods have been with me since the last millenium, some have been revamped as recently as yesterday, all of them are very useful to me. I hope you find them useful as well!
Rowlett, TX | 214.789.1733 | email@example.com