3

I have a file with over 2000 lines that I need to parse. I want to make sure I get 100% accurate results, which will then be imported to my MariaDB.

The file looks like this:

line 0: #start#
line 1: 20111211\200000
line 2: n=john|l=smith,131_stree_apt#102_city_state_11111_country,19989989988|17771112222,user%64domain.com,12,21,551|626|23\r
...
line 2156: #end#

so line 1 is the date time in 24h format line 2 is the line format:

  • n = name
  • l = last name
  • full address
  • phone + cell phone
  • email
  • total goals
  • total passes
  • time on ice + time on bench
  • penality minutes

I can't figure out the regular expression. My other idea was to parse each line and then parse each comma, then each pipe, etc., but I think this approach is slow and less accurate then regex. Am I right?

Farray
  • 8,290
  • 3
  • 33
  • 37
Xin Qian Ch'ang
  • 665
  • 1
  • 7
  • 13
  • 4
    "which i think is slow and less accurate then regexp" --- how can you compare 2 implementations if you haven't even made any of them yet? – zerkms Dec 09 '11 at 00:39
  • @zerkms sorry if i'm not as smart as you, i'm just trying to understand and make my code as good as possible – Xin Qian Ch'ang Dec 09 '11 at 00:42
  • 3
    learning is an iterative process. So start with just something **you can implement** and show us the result to review (there is even a stackexchange for it: http://codereview.stackexchange.com/) – zerkms Dec 09 '11 at 00:43
  • The explode approach will be more flimsy than a specific validation regex, of course. -- What regex have you tried? Where did it fail? – mario Dec 09 '11 at 00:43
  • 1
    @XinQianCh'ang Part of the process of making your code "as good as possible" is exploring cases like this by building a parser using delimiters, and a parser using regex, and comparing the results. If you haven't compared both cases, what is the basis for your "I think regex is faster and more accurate" theory? – Farray Dec 09 '11 at 00:45

3 Answers3

7

i can't figure out the regular expression so my idea was to parse each line and then parse each comma, then each pipe then .... which i think is slow and less accurate then regexp

Why don't you go and try it out? Don't let this intimedate you, be bold. In general, I'd do the following if I were you:

  1. Make a straightforward implementation
  2. Test it
  3. Tune it

~2000 records is not so much, so the third step might not even be required (in particular if this is a migration that only runs once -- so what if it takes 2 minutes?).

BTW: This is general programming advice and does apply to a lot of problems. Don't optimize prematurely.

Community
  • 1
  • 1
middus
  • 9,103
  • 1
  • 31
  • 33
  • 1
    +1 It will be as accurate as a regular expression if you understand the format fully and write correct code. As for speed, I think you'd have to try pretty hard to write an implementation that takes longer than a few seconds for 2000 records. – grossvogel Dec 09 '11 at 01:16
2

Write a parser, they're more powerful than regular expressions, and much easier to write and reason about.

Read the file character by character, for each character decide what you want to do with it.

Initially you're reading the 'date', then when you find a newline you know you're done parsing the date.

Then you parse each record. First you expect to see an n, you keep reading till you get a |, then you expect an l, keep reading till you find a , etc. If ever you find something you didn't expect you know there's either a bug in your parser, or there's an error in the data file.

You will never know if you read the file perfectly, there is no 100%. There is only ever 'good enough'. This is a general law in Computer Science

Halcyon
  • 57,230
  • 10
  • 89
  • 128
1

Obviously I won't give you the complete codez. But as placeholder answer and to showcase the basic approach:

preg_match('/
   ^
     n=(\w+)       # just alphanumerics
     \|
     l=(\w+)
     ,
     ([\w\h\#]+)    # mixture of letters and space and #
     ,
     ([^,]*)       # anything but commas
     ...
   $
  /x', $line, $match);

It just needs as many character classes and capture groups as you have fields in your pseudo-CSV line. \d+ for matching just decimals might also be useful.

Using basic string functions to write a fake parser is obviously not sensible here, when a regex can do exactly that more reliably and with less code.

mario
  • 144,265
  • 20
  • 237
  • 291