June 09, 2012

Using String.split(String) vs. Using a StringTokenizer

Scenario 1:
Scanner is designed for cases where you need to parse a string, pulling out
data of different types. It's very flexible, but arguably doesn't give you
the simplest API for simply getting an array of strings delimited by a
particular expression.

String.split() and Pattern.split() give you an easy syntax for doing the
latter, but that's essentially all that they do. If you want to parse the
resulting strings, or change the delimiter halfway through depending on a
particular token, they won't help you with that.

StringTokenizer is even more restrictive than String.split(), and also a
bit fiddlier to use. It is essentially designed for pulling out tokens
delimited by fixed substrings. Because of this restriction, it's about
twice as fast as String.split(). (See my comparison of String.split() and
StringTokenizer.) It also predates the regular expressions API, of which
String.split() is a part.

You'll note from my timings that String.split() can still tokenize
thousands of strings in a few milliseconds on a typical machine. In
addition, it has the advantage over StringTokenizer that it gives you the
output as a string array, which is usually what you want. Using an
Enumeration, as provided by StringTokenizer, is too "syntactically fussy"
most of the time. From this point of view, StringTokenizer is a bit of a
waste of space nowadays, and you may as well just use String.split().


Scenario 2:
I think the biggest difference is: with a StringTokenizer, the delimiter is
just one character long. You supply a list of characters that count as
delimiters, but in that list, each character is a single delimiter. With
split(), the delimiter is a regular expression, which is something much
more powerful (and more complicated to understand). It can be any length.
Regular expressions may be harder to understand at first, but when you
learn how to use them, they're much more useful.

Also, if you need to parse empty tokens, e.g. a comma-separated line like

one,,three,,,six

where the field values are "one", "", "three", "", "" and "six" where the
three empty strings are indicated by the commas with nothing between them -
that's a lot more work with a StringTokenizer. By default it gives you just
"one", "three", "six" and skips the empties. You can use a special
constructor that takes a boolean to tell the StringTokenizer to return
delimiters, but that gets complicated too. I'll skip the details. It's much
easier to use split(","), which immediately returns {"one", "", "three",
"", "", "six"), exactly right. The short version is: StringTokenizer
doesn't handle empty strings well. But split() does.


Scenario 3:

Some times you want split to behave same as Stringtokenizer then you can
use the reqular expressions for acheiving the same effect like spilt(",*")
which takes care of a single , or multiple commas. I you want the delimiter
to be , as well as space you can use split("[ ,]") also if you want to trim
your strings and has a comma seperated delimiter you can use split(" *, *")


Scenario 4:
Most programmers use the String.split(String) method to convert a String to
a String array specifying a delimiter. However, I feel it's unsafe to rely
on the split() method in some cases, because it doesn't always work
properly. For example, sometimes after calling split() the first array
index holds a space character even though the string contains no leading
space. Here's an example where split() fails:

public class StringTest {
public static void main(String[] args) {
final String SPLIT_STR = "^";
final String mainStr = "Token-1^Token-2^Token-3";
final String[] splitStr = mainStr.split(SPLIT_STR);
System.out.println("First Index Of ^ : " +
mainStr.indexOf(SPLIT_STR));
for(int index=0; index < splitStr.length; index++) {
System.out.println("Split : " + splitStr[index]);
}
}
}This program outputs:

First Index Of ^ : 7
Split : Token-1^Token-2^Token-3But the expected output would be:

First Index Of ^ : 7
Split : Token-1
Split : Token-2
Split : Token-3In this case, the split doesn't work because the caret
character delimiter needs to be escaped. The workaround in this case is to
declare SPLIT_STR = "\\^". With that change, the output matches the
expected output.

A safer way to split the string would be by using the StringTokenizer API.
Here's an example:

import java.util.StringTokenizer;
public class StringTest {
public static void main(String[] args) {
final String SPLIT_STR = "^";
final String mainStr = "Token-1^Token-2^Token-3";
final StringTokenizer stToken = new StringTokenizer(
mainStr, SPLIT_STR);
final String[] splitStr = new String[stToken.countTokens()];
int index = 0;
while(stToken.hasMoreElements()) {
splitStr[index++] = stToken.nextToken();
}
for(index=0; index < splitStr.length; index++) {
System.out.println("Tokenizer : " + splitStr[index]);
}
}
}The output of the preceding program is:

Tokenizer : Token-1
Tokenizer : Token-2
Tokenizer : Token-3

No comments:

Post a Comment

I'm certainly not an expert, but I'll try my hardest to explain what I do know and research what I don't know.

My Favorite Site's List

#update below script more than 500 posts