Should I use Regex to parse my file, or is there a better way?

Question

I have a file with over 2000 lines that I need to parse. I want to make sure I get 100% accurate results, which will then be imported to my MariaDB.

The file looks like this:

line 0: #start#
line 1: 20111211\200000
line 2: n=john|l=smith,131_stree_apt#102_city_state_11111_country,19989989988|17771112222,user%64domain.com,12,21,551|626|23\r
...
line 2156: #end#

so line 1 is the date time in 24h format line 2 is the line format:

n = name
l = last name
full address
phone + cell phone
email
total goals
total passes
time on ice + time on bench
penality minutes

I can't figure out the regular expression. My other idea was to parse each line and then parse each comma, then each pipe, etc., but I think this approach is slow and less accurate then regex. Am I right?

"which i think is slow and less accurate then regexp" --- how can you compare 2 implementations if you haven't even made any of them yet? — zerkms, Dec 09 '11 at 00:39
@zerkms sorry if i'm not as smart as you, i'm just trying to understand and make my code as good as possible — Xin Qian Ch'ang, Dec 09 '11 at 00:42
learning is an iterative process. So start with just something **you can implement** and show us the result to review (there is even a stackexchange for it: http://codereview.stackexchange.com/) — zerkms, Dec 09 '11 at 00:43
The explode approach will be more flimsy than a specific validation regex, of course. -- What regex have you tried? Where did it fail? — mario, Dec 09 '11 at 00:43
@XinQianCh'ang Part of the process of making your code "as good as possible" is exploring cases like this by building a parser using delimiters, and a parser using regex, and comparing the results. If you haven't compared both cases, what is the basis for your "I think regex is faster and more accurate" theory? — Farray, Dec 09 '11 at 00:45

score 7 · Accepted Answer · edited May 23 '17 at 12:11

7

i can't figure out the regular expression so my idea was to parse each line and then parse each comma, then each pipe then .... which i think is slow and less accurate then regexp

Why don't you go and try it out? Don't let this intimedate you, be bold. In general, I'd do the following if I were you:

Make a straightforward implementation
Test it
Tune it

~2000 records is not so much, so the third step might not even be required (in particular if this is a migration that only runs once -- so what if it takes 2 minutes?).

BTW: This is general programming advice and does apply to a lot of problems. Don't optimize prematurely.

edited May 23 '17 at 12:11

Community

1
1

answered Dec 09 '11 at 00:52

middus

9,103
1
31
33

1

+1 It will be as accurate as a regular expression if you understand the format fully and write correct code. As for speed, I think you'd have to try pretty hard to write an implementation that takes longer than a few seconds for 2000 records. – grossvogel Dec 09 '11 at 01:16

Halcyon · Answer 2 · 2011-12-09T00:49:13.797

2

Write a parser, they're more powerful than regular expressions, and much easier to write and reason about.

Read the file character by character, for each character decide what you want to do with it.

Initially you're reading the 'date', then when you find a newline you know you're done parsing the date.

Then you parse each record. First you expect to see an n, you keep reading till you get a |, then you expect an l, keep reading till you find a , etc. If ever you find something you didn't expect you know there's either a bug in your parser, or there's an error in the data file.

You will never know if you read the file perfectly, there is no 100%. There is only ever 'good enough'. This is a general law in Computer Science

edited Dec 09 '11 at 00:49

answered Dec 09 '11 at 00:43

Halcyon

57,230
10
89
128

1

In addition to this - I'd like to say that in common that parsers are known as `Finite State Automate` – zerkms Dec 09 '11 at 00:44
Yea, but I feel adding jargon isn't going to help in this case ;) – Halcyon Dec 09 '11 at 00:45
Well, but it might help others finding your answer. So kudos to @zerkms for naming the child ;). – middus Dec 09 '11 at 00:53

score 1 · Answer 3 · answered Dec 09 '11 at 00:56

1

Obviously I won't give you the complete codez. But as placeholder answer and to showcase the basic approach:

preg_match('/
   ^
     n=(\w+)       # just alphanumerics
     \|
     l=(\w+)
     ,
     ([\w\h\#]+)    # mixture of letters and space and #
     ,
     ([^,]*)       # anything but commas
     ...
   $
  /x', $line, $match);

It just needs as many character classes and capture groups as you have fields in your pseudo-CSV line. \d+ for matching just decimals might also be useful.

Using basic string functions to write a fake parser is obviously not sensible here, when a regex can do exactly that more reliably and with less code.

answered Dec 09 '11 at 00:56

mario

144,265
20
237
291

It does not make the impression that the OP knows his way around regex, though. So this might be a case of worse is better. – middus Dec 09 '11 at 01:10
Validation is still the best technical approach, even if unsuitable for OPs experience. – mario Dec 09 '11 at 01:13
I didn't question that, I +1d. – middus Dec 09 '11 at 01:27

Should I use Regex to parse my file, or is there a better way?

3 Answers3