1

I wrote a little python script to parse all rows of a large data document.

I collected some type of rows:

LLNNNLL [Mixed Data and Numbers] 1.650,00

NNNNNN-LNN [Mixed Data and Numbers] 49,00

LLNNNL [Mixed Data and Numbers] 208,00

LLNNNLLL [Mixed Data and Numbers] 3,00

This is my regex pattern: pattern = "^([A-Z\-0-9]){4,10}.*\d+,\d{2}" Is there a more accurate way to do that? Eg.: how can I specify that each row must have at least numbers AND letter?

bit
  • 427
  • 1
  • 6
  • 14
  • Need more clarification on what you are currently doing. What's not working Or What's your expected output. –  Apr 07 '16 at 14:32
  • [It seems working well](https://regex101.com/r/mT0nQ1/1). What don't you like about the pattern? – Wiktor Stribiżew Apr 07 '16 at 14:33
  • This file contains more than 400 pages of this type. I only wish to have a more accurate regex pattern because with this I exctracted 1400 rows. Otherwise I just wish to know if this pattern is correct – bit Apr 07 '16 at 14:35
  • It seems to work, ok, but it also matchs rows that have only numbers or only letters. I wish to match only rows starting with an alphanumerical code which can contains a '-' character – bit Apr 07 '16 at 14:36
  • So, the starting "word" should have both a letter and a digit? Use [`^(?=[\w-]*[A-Z])(?=[\w-]*[0-9])[A-Z0-9-]{4,10}.*\d+,\d{2}`](https://regex101.com/r/mT0nQ1/3) – Wiktor Stribiżew Apr 07 '16 at 14:47
  • @WiktorStribiżew: That looks like an answer to me, and it would be nice to have an explanation of it. – Scott Hunter Apr 07 '16 at 14:48
  • @ScottHunter: I am sorry, I do not feel I understand the question. Maybe the whole line should contain a digit and a letter? No idea. – Wiktor Stribiżew Apr 07 '16 at 14:50
  • Ah, I see. The `[\w-]` should actually be `.` – Wiktor Stribiżew Apr 07 '16 at 14:53

1 Answers1

2

how can I specify that each row must have at least numbers AND letter?

That can be done with the help of positive lookaheads.

pattern = "^(?=[^A-Z]*[A-Z])(?=\D*\d)[A-Z0-9-]{4,10}.*\d+,\d{2}"

The (?=[^A-Z]*[A-Z]) will be triggered at the start of the string and will require at least one A-Z letter in the string. The (?=\D*\d) will also be triggered (after the preceding lookahead returns true) and will require at least one digit. If there is no digit in the string, the match will be failed (no match will be found).

Also, if the number must be at the end of the "row" add a $ anchor (end of string).

Besides, note that .* will "eat up the digits (supposed to be matched with \d+,\d{2}) up to the one before a comma since the .* pattern is greedy. It makes no difference here unless you want to capture the float number. Then, use lazy matching .*?.

In case the pattern should be case insensitive, use a case insensitive flag re.I when compiling the pattern, or add (?i) inline modifier to the pattern start.

UPDATE

If you want to limit the condition to the first non-whitespace chunk, you can use

^(?=[0-9-]*[A-Z])(?=[A-Z-]*\d)[A-Z0-9-]{4,10}.*\d+,\d{2}
    ^^^^^^^         ^^^^^^^

where we check if there is a letter after optional 0+ digits/hyphen and a digit after 0+ letters or hyphen (see demo) or

^(?=\S*[A-Z])(?=\S*\d)[A-Z0-9-]{4,10}.*\d+,\d{2}

where we check for letters and digits after 0+ non-whitespace characters (\S*). See another demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • I also have rows which starts with Letter followed by numbers (NNNNNN-LNN [Mixed Data and Numbers] 49,00). Can you pattern match this type of row? – bit Apr 07 '16 at 15:02
  • Well, I do not understand what `N` and `L` stand for, let me try to help judging by the *starts with Letter followed by numbers*: [`^[A-Z][0-9]+.*\d+,\d{2}`](https://regex101.com/r/oL5mL8/1) – Wiktor Stribiżew Apr 07 '16 at 15:11
  • Does that help? Work? – Wiktor Stribiżew Apr 08 '16 at 07:29
  • Sorry for not writing for so long. Yes L=letters, N=numbers. It can starts with letter OR numbers but each code has to contains letters and numbers – bit Apr 11 '16 at 11:11
  • Ok, so that condition only applies to the first non-whitespace chunk, right? Then `pattern = "^(?=[0-9-]*[A-Z])(?=[A-Z-]*\d)[A-Z0-9-]{4,10}.*\d+,\d{2}"` should work well. – Wiktor Stribiżew Apr 11 '16 at 11:14