Questions tagged [tokenize]

Tokenizing is the act of splitting a string into discrete elements called tokens.

Tokenizing is the act of splitting a stream of text into discrete elements called tokens using a delimiter present in the stream. These tokens can then be processed further, for example to search for a value or assign to an array for looping.

Example (VBA):

Dim tokens As Variant
Dim sampleString As String
Dim i As Long

sampleString = "The quick brown fox jumps over the lazy dog."

' tokenize string based on space delimiter
tokens = Split(sampleString, " ")

' list tokens
For i = LBound(tokens) To UBound(tokens)
  MsgBox tokens(i)
Next i

Related Tags:

Links:

2964 questions
644
votes
35 answers

Parse (split) a string in C++ using string delimiter (standard C++)

I am parsing a string in C++ using the following: using namespace std; string parsed,input="text to be parsed"; stringstream input_stringstream(input); if (getline(input_stringstream,parsed,' ')) { // do some processing. } Parsing with a…
TheCrazyProgrammer
  • 7,918
  • 8
  • 25
  • 41
475
votes
37 answers

How do I tokenize a string in C++?

Java has a convenient split method: String str = "The quick brown fox"; String[] results = str.split(" "); Is there an easy way to do this in C++?
Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
470
votes
17 answers

What is the easiest/best/most correct way to iterate through the characters of a string in Java?

Some ways to iterate through the characters of a string in Java are: Using StringTokenizer? Converting the String to a char[] and iterating over that. What is the easiest/best/most correct way to iterate?
Paul Wicks
  • 62,960
  • 55
  • 119
  • 146
389
votes
17 answers

How to split a string in shell and get the last field

Suppose I have the string 1:2:3:4:5 and I want to get its last field (5 in this case). How do I do that using Bash? I tried cut, but I don't know how to specify the last field with -f.
cd1
  • 15,908
  • 12
  • 46
  • 47
191
votes
4 answers

Looking for a clear definition of what a "tokenizer", "parser" and "lexers" are and how they are related to each other and used?

I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a program will go through c/h source files to extract…
lordhog
  • 3,427
  • 5
  • 32
  • 43
170
votes
2 answers

How to use stringstream to separate comma separated strings

I've got the following code: std::string str = "abc def,ghi"; std::stringstream ss(str); string token; while (ss >> token) { printf("%s\n", token.c_str()); } The output is: abc def,ghi So the stringstream::>> operator can separate strings…
B Faley
  • 17,120
  • 43
  • 133
  • 223
168
votes
10 answers

Scanner vs. StringTokenizer vs. String.Split

I just learned about Java's Scanner class and now I'm wondering how it compares/competes with the StringTokenizer and String.Split. I know that the StringTokenizer and String.Split only work on Strings, so why would I want to use the Scanner for a…
Dave
  • 4,050
  • 6
  • 30
  • 35
162
votes
12 answers

How to get rid of punctuation using NLTK tokenizer?

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also…
lizarisk
  • 7,562
  • 10
  • 46
  • 70
152
votes
5 answers

Can a line of Python code know its indentation nesting level?

From something like this: print(get_indentation_level()) print(get_indentation_level()) print(get_indentation_level()) I would like to get something like this: 1 2 3 Can the code read itself in this way? All I want is the output from…
146
votes
9 answers

NSString tokenize in Objective-C

What is the best way to tokenize/split a NSString in Objective-C?
ggarber
130
votes
14 answers

Splitting string into multiple rows in Oracle

I know this has been answered to some degree with PHP and MYSQL, but I was wondering if someone could teach me the simplest approach to splitting a string (comma delimited) into multiple rows in Oracle 10g (preferably) and 11g. The table is as…
marshalllaw
  • 1,301
  • 2
  • 9
  • 3
76
votes
4 answers

How to get a Token from a Lucene TokenStream?

I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream. The worst part is that I'm looking at the comments in the JavaDocs that address my…
Eric Wilson
  • 57,719
  • 77
  • 200
  • 270
68
votes
4 answers

Split string with PowerShell and do something with each token

I want to split each line of a pipe on spaces, and then print each token on its own line. I realise that I can get this result using: (cat someFileInsteadOfAPipe).split(" ") But I want more flexibility. I want to be able to do just about anything…
Pieter Müller
  • 4,573
  • 6
  • 38
  • 54
67
votes
1 answer

Google Sites API full-text search does not work for non-Western languages

In my JavaEE application, I'm using the Atom-based Google Sites API to retrieve content from a non-public Google Site. In essence, we're using the Google Site as a lightweight CMS, and from within the application I use the API to retrieve the site…
Robby Cornelissen
  • 91,784
  • 22
  • 134
  • 156
66
votes
2 answers

How do I tokenize a string sentence in NLTK?

I am using nltk, so I want to create my own custom texts just like the default ones on nltk.books. However, I've just got up to the method like my_text = ['This', 'is', 'my', 'text'] I'd like to discover any way to input my "text" as: my_text =…
diegoaguilar
  • 8,179
  • 14
  • 80
  • 129
1
2 3
99 100