How to match accented characters but not tab

Question

I'm trying to match the company name in this string delimited with tabs.
Below table does not have tabs when you copy it, but I have replaced tabs with two spaces, which I assume will work fine for testing.

1025164  HERBEX IBERIA, S.L.U.  KY01  4600292091    
1016379  DRISCOLL´S OF EUROPE B.V.  KY01  4600322589    
1008809  LANDGARD NORD OBST & GEMÜSE GM  KY01  4600347315   
1008835  C.A.S.I. : COOPERATIVA PROVINC  KY01  4600348112   
1019258  SYDGRÖNT EKONOMISK FÖRENING  KY02  4600343422

(The second column of the above, between 7 digit number and KY0 above)
In real life the columns are not always in the same order since it's a user preference.

I just took a few examples but names could also include /éèáà()´, pretty much anything (sadly).

I found another question here Concrete Javascript Regex for Accented Characters (Diacritics)
When I use the regex patterns in that thread, example: "\t([A-zÀ-ÿ0-9\s\.\,\_\-\'\&]+)\t" (I know some characters are still missing) to match between two tabs it becomes greedy and matches the whole line.

Is there any pattern that could match any character in a company name between tabs (or two spaces as the example above)?

If your company names don't have numbers try this pattern `(\t)(\D*?)(\t)` https://regex101.com/r/KweS9d/1 — Pᴇʜ, May 25 '21 at 12:48

JvdV · Accepted Answer · 2021-05-25T13:26:28.370

2

Instead of returning a matched part, I matched everything and replaced it with the 1st capture group. Hope it helps.

Sub Test()

Dim str As String: str = "1025164" & vbTab & "HERBEX IBERIA, S.L.U." & vbTab & "KY01" & vbTab & "4600292091"

With CreateObject("vbscript.regexp")
    .Global = True
    .Pattern = "(?:^|\t)(?:\d+|KY\d+|([^\t]+))(?=\t|$)"
    Debug.Print .Replace(str, "$1")
End With

End Sub

Have a look at this online demo to test the pattern:

(?:^|\t) - Match either start line anchor or a tab. Unfortunately the VBA-regex object does not support lookbehinds.
(?: - Open a non-capture group to start matching all parts you don't want to capture first:
- \d+ - match 1+ digits;
- | - Or:
- KY\d+ - Match "KY" followed by 1+ digits;
- | - Or:
- ([^\t]+) - nest a capture group to capture 1+ non-tabs.
- ) - Close non-capture group.
(?=\t|$) - Positive lookahead to assert captured text is followed by either a tab or end-line anchor.

edited May 25 '21 at 13:26

answered May 25 '21 at 12:46

JvdV

70,606
8
39
70

OP said: *"In real life the columns are not always in the same order since it's a user preference."*. Therefore this code won't work if columns change (yes data example does not reflect that properly). • He actually wants that part `HERBEX IBERIA, S.L.U.` and the only thing he knows it is in between 2 tabs. – Pᴇʜ May 25 '21 at 12:49
I suggested `(\t)(\D*?)(\t)` if company names have no numbers. – Pᴇʜ May 25 '21 at 12:52
@Pᴇʜ Sadly I found `3 M SVENSKA AB` :-/ – Andreas May 25 '21 at 12:53
@Andreas well then it might get a bit hard because you would have to exclude all the other stuff that is also between 2 tabs. – Pᴇʜ May 25 '21 at 12:55
I know it's hard. That is why I asked the qustion ;-) – Andreas May 25 '21 at 12:56
But Pᴇʜ is correct, I can't know for sure what columns and/or what order they are in. – Andreas May 25 '21 at 12:57
Try this `(\t)(?!KY)(.*?)(\t)` https://regex101.com/r/ATWw8B/1 it worked for my test batch. – Pᴇʜ May 25 '21 at 12:59
1

@Pᴇʜ, that is not right if the 1st column contains the info to be pulled I guess (or the last) since there won't be leading/trailing tabs I assume. – JvdV May 25 '21 at 13:00
No there shouldn't be any spaces or something like that. I can't be sure 100% sure, but I'd say 99.9% sanitized. – Andreas May 25 '21 at 13:03
1

Right, this is a shot in the dark but try: `(?:^|\t)(?:\d+|KY\d+|([^\t]+))(?=\t|$)` and retrieve the 1st submatch – JvdV May 25 '21 at 13:04
@JvdV That is a good observation. If it's the first column then it can't be `\t(...)\t` – Andreas May 25 '21 at 13:05
@JvdV yes your last one looks like a good catch https://regex101.com/r/0SpKuA/1 – Pᴇʜ May 25 '21 at 13:06
@JvdV first match of that regex returned the seven digit number with the same layout as above – Andreas May 25 '21 at 13:06
Huh... why didn't that work in SAP then.. Let me try again. – Andreas May 25 '21 at 13:07
@Andreas because RegEx != RexEx ^^ There are different falvors. – Pᴇʜ May 25 '21 at 13:07
I know there are differences but I always thought regex101 and Excel used the same. Sorry about the image but here is the result: https://i.stack.imgur.com/KJbwd.png – Andreas May 25 '21 at 13:12
Sorry I need to go and pick up my kids now, will get back on this tomorrow morning. – Andreas May 25 '21 at 13:15
1

@JvdV Edit looks good to me, I'm just not sure which RexEx flavor Excel actually uses, so it might be different, but it looks like it should work in most flavors. – Pᴇʜ May 25 '21 at 13:25
@Pᴇʜ, VBScript regular expressions uses the IE 5.5 implementation which is a JavaScript regular expression flavor but lacking quite a bit of features otherwise supported. A lookbehind being one of the more known ones. – JvdV May 25 '21 at 13:30
@JvdV I knew there was something why I can't remember the flavor of VBA. ^^ You always have to try it to be sure it works :/ • Thx for the Info! – Pᴇʜ May 25 '21 at 13:43
2

I can work with this! Only if I move a column that I highly doubt anyone uses as the first column does it "not work", see yellow highlighted https://i.stack.imgur.com/B4DbL.png but I believe I can just replace(str, "Planerad", "") and that too is solved. The excess can be trimmed with split on tab. Thank you both I would not have gotten here without you. – Andreas May 26 '21 at 05:10

score 1 · Answer 2 · answered May 25 '21 at 12:57

1

I would use a different attempt using the split-command. The following code assumes that you have Tabs as separator and that the company name is found if the column is not numeric (only digits) and does not start with 'KY'.

Function getCompanyName(line As String) As String
    Const separator = vbTab ' Replace with "  " if you need that.
    
    Dim tokens() As String, i As Integer
    tokens = Split(line, separator)
    For i = 0 To UBound(tokens)
        If Not IsNumeric(tokens(i)) And Left(tokens(i) <> "KY") Then
            getCompanyName = tokens(i)
            Exit Function
        End If
    Next
End Function

answered May 25 '21 at 12:57

FunThomas

23,043
3
18
34

Interesting approach. It will not work in the way you posted but it could work with some more checks. I trimmed the table a bit, it has 20-ish columns where two of them is custom text. It seems these custom text fields are only used to (misspell) a few words. So perhaps I can replace with nothing and if len() is less than 3 characters then it's not a company name. I will look at this further if there is no other perfect regex coming. – Andreas May 25 '21 at 13:02
No matter which way you go, you need to be able to tell if (or if not) you have found the company name. – FunThomas May 25 '21 at 14:05

How to match accented characters but not tab

2 Answers2