Questions tagged [text-parsing]

Text parsing is a variation of parsing which refers to the action of breaking a stream of text into different components, and capturing the relationship between those components.

When the stream of text is arbitrary, parsing is often used to mean breaking the stream into constituent atoms (words or lexemes).

When the stream of text corresponds to natural language, parsing is used to mean breaking the stream into natural language elements (words and punctuation) and discovering the structure of the text as phrases or sentences.

When the string of text corresponds to a computer source language (or other formal language), parsing consists of applying any of a variety of parsing algorithms (ad hoc, recursive descent, LL, LR, Packrat, Earley or other) to the source text (often broken into lexemes by another lower level parser called a "lexer") to verify the validity of the source language, and often to construct a parse tree representing the grammar productions used to tile the text.

1268 questions
2301
votes
21 answers

How to delete from a text file, all lines that contain a specific string?

How would I use sed to delete all lines in a text file that contain a specific string?
A Clockwork Orange
  • 23,913
  • 7
  • 25
  • 28
873
votes
21 answers

How to convert string representation of list to a list

I was wondering what the simplest way is to convert a string representation of a list like the following to a list: x = '[ "A","B","C" , " D"]' Even in cases where the user puts spaces in between the commas, and spaces inside of the quotes, I need…
harijay
  • 11,303
  • 12
  • 38
  • 52
102
votes
26 answers

Split string containing command-line parameters into string[] in C#

I have a single string that contains the command-line parameters to be passed to another executable and I need to extract the string[] containing the individual parameters in the same way that C# would if the commands had been specified on the…
Anton
  • 6,860
  • 12
  • 30
  • 26
80
votes
4 answers

Get integer value from malformed query string

I'm looking for an way to parse a substring using PHP, and have come across preg_match however I can't seem to work out the rule that I need. I am parsing a web page and need to grab a numeric value from the string, the string is like…
MonkeyBlue
  • 2,234
  • 6
  • 31
  • 41
76
votes
43 answers

Evaluating a string of simple mathematical expressions

Challenge Here is the challenge (of my own invention, though I wouldn't be surprised if it has previously appeared elsewhere on the web). Write a function that takes a single argument that is a string representation of a simple mathematical…
Noldorin
  • 144,213
  • 56
  • 264
  • 302
72
votes
4 answers

Difference between parsing a text file in r and rb mode

What makes parsing a text file in 'r' mode more convenient than parsing it in 'rb' mode? Especially when the text file in question may contain non-ASCII characters.
MxLDevs
  • 19,048
  • 36
  • 123
  • 194
67
votes
2 answers

What is CoNLL data format?

I am using a open source jar (Mate Parser) which outputs in the CoNLL 2009 format after dependency parsing. I want to use the dependency parsing results for Information Extraction, however, I only understand part of the output in the CoNLL data…
52
votes
12 answers

Replace a whole line where a particular word is found in a text file

How can I replace a particular line of text in file using php? I don't know the line number. I want to replace a line containing a particular word.
kishore
  • 1,017
  • 3
  • 12
  • 21
46
votes
6 answers

How to get the first column of every line from a CSV file?

How do get the first column of every line in an input CSV file and output to a new file? I am thinking using awk but not sure how.
Junba Tester
  • 801
  • 2
  • 9
  • 15
36
votes
9 answers

PHP - parsing a txt file

I have a .txt file that has the following details: ID^NAME^DESCRIPTION^IMAGES 123^test^Some text goes here^image_1.jpg,image_2.jpg 133^hello^some other test^image_3456.jpg,image_89.jpg What I'd like to do, is parse this ad get the values into a…
terrid25
  • 1,926
  • 8
  • 46
  • 87
35
votes
12 answers

Can awk deal with CSV file that contains comma inside a quoted field?

I am using awk to perform counting the sum of one column in the csv file. The data format is something like: id, name, value 1, foo, 17 2, bar, 76 3, "I am the, question", 99 I was using this awk script to count the sum: awk -F, '{sum+=$3} END…
maguschen
  • 765
  • 2
  • 8
  • 12
34
votes
9 answers

Python parsing bracketed blocks

What would be the best way in Python to parse out chunks of text contained in matching brackets? "{ { a } { b } { { { c } } } }" should initially return: [ "{ a } { b } { { { c } } }" ] putting that as an input should return: [ "a", "b", "{ { c }…
Martin
  • 12,469
  • 13
  • 64
  • 128
33
votes
4 answers

What does NN VBD IN DT NNS RB means in NLTK?

when I chunk text, I get lots of codes in the output like NN, VBD, IN, DT, NNS, RB. Is there a list documented somewhere which tells me the meaning of these? I have tried googling nltk chunk code nltk chunk grammar nltk chunk tokens. But I am not…
Knows Not Much
  • 30,395
  • 60
  • 197
  • 373
29
votes
5 answers

Best way to get all digits from a string

Is there any better way to get take a string such as "(123) 455-2344" and get "1234552344" from it than doing this: var matches = Regex.Matches(input, @"[0-9]+", RegexOptions.Compiled); return String.Join(string.Empty, matches.Cast() …
Chris Marisic
  • 32,487
  • 24
  • 164
  • 258
25
votes
13 answers

How should I detect which delimiter is used in a text file?

I need to be able to parse both CSV and TSV files. I can't rely on the users to know the difference, so I would like to avoid asking the user to select the type. Is there a simple way to detect which delimiter is in use? One way would be to read in…
samiz
  • 1,043
  • 1
  • 13
  • 21
1
2 3
84 85