2

I have a big .csv file (~600k lines, 56Mo), and inside there is database lines (on each line, there's an id, a client name, a client address, a client birthday date, etc). The problem is that, on some lines, some data is written badly (commas not supposed to be there, that mess up the columns).

I guessed that I had to do some RegEx to detect the problematic lines, and to replace the wrong commas with a dash or anything. I followed this article, and, after some tries, I got him to detect the messed-up lines.

Private Sub simpleRegex()
Dim strPattern As String: strPattern = "[^a-zA-Z0-9_,\-]([A-z]+)\,[^a-zA-Z0-9_,\-]([A-z]+)"

Dim strReplace As String: strReplace = "[^a-zA-Z0-9_,\-][A-z]+\-[^a-zA-Z0-9_,\-][A-z]"

Dim regEx As Object
Set regEx = CreateObject("VBScript.RegExp")
Dim strInput As String
Dim Myrange As Range

Set Myrange = ActiveSheet.Range("A1:A2000")

For Each cell In Myrange
    If strPattern <> "" Then
        strInput = cell.Value

        With regEx
            .Global = True
            .MultiLine = True
            .IgnoreCase = False
            .Pattern = strPattern
        End With

        If regEx.Test(strInput) Then
            MsgBox (regEx.Replace(strInput, strReplace))
        Else

        End If
    End If
Next End Sub

The problem is, this solution works if I wanted to change the cibled lines with an unique value, a predefined string (like strReplace="replacement words"). What I want, is to be able to target a succession of characters that match my pattern, and to replace only one character (the comma) in it. An example of it would be :

728,"HAY,HAYE",Marie,François,RAUTUREAU,85,29/05/1856,68;

into :

728,"HAY-HAYE",Marie,François,RAUTUREAU,85,29/05/1856,68;

Do you have a solution?

(Sorry if bad english, it's not my mother tongue).

Community
  • 1
  • 1
Alan
  • 139
  • 10
  • Your regex [does not match the provided string](https://regex101.com/r/PtRv8y/1). Also, `strReplace` should be a replacement pattern, not a regex pattern. Check [backreferences](http://www.regular-expressions.info/replacebackref.html), too. – Wiktor Stribiżew Nov 07 '16 at 08:10
  • The Regex I wrote seems to work on some lines only, thanks for pointing that out for me, I'm not exactly an expert in Regex. And thank you for the reference, I'm gonna read it out (my mother tongue is French, so I have some difficulties reading big articles like this one). – Alan Nov 07 '16 at 09:17
  • It seems you need to match substrings enclosed with double quotation marks. Is the double quote used as a field qualifier in this CSV? Are there cases where `"` does not delimit a field? – Wiktor Stribiżew Nov 07 '16 at 09:22
  • No, in fact, it seems like everytime I have a problematic comma that messes up the columns, it's a comma between two double quotes. Here is how it looks like by the way : http://puu.sh/s9mC4/d870a2d915.png – Alan Nov 07 '16 at 09:26

3 Answers3

1

You can use (?<=(Your Word)) to catch specific characters after a specific word. In your case, this code will help you find the comma:

(?<=(HAY))\,

Update:

Try this and I also updated the demo:

,(?=[^"]+")

Demo: https://regex101.com/r/0rtcFt/6

Ibrahim
  • 6,006
  • 3
  • 39
  • 50
  • And if there isn't a "HAY"? what about "HAZ" or "lhzikjrtjfnosrg"? Check *hardcoded* for every possible combination? – Dirk Reichel Nov 07 '16 at 08:42
  • Thanks for the answer! But Dirk is right, the problem is, I took "HAY,HAYE" as an example, but I also have "BREMAND,BREMAUD", or stuff like that, and there are way too many lines to hardcode all of them. – Alan Nov 07 '16 at 08:53
  • @DirkReichel I updated my code, thank you for pointing that out. – Ibrahim Nov 07 '16 at 09:10
  • @Alan The code is updated. Let me know if it works for you. – Ibrahim Nov 07 '16 at 09:10
  • Seems to work for me ! I'll just have to adapt the RegEx a little bit, because I have other issues (the same issues with different columns in fact, like columns getting messed up with decimal numbers), but I should be able to work it out ! The last problem I have is to understand how to use the replacement variable, and how to give him something else than a fixed value (to just change the comma into something else). – Alan Nov 07 '16 at 09:15
  • The first regex won't work as VBA regex does not support lookbehinds, and the second one will fail as soon as there is a space or a non-alpha char between the comma and the `"` ([demo](https://regex101.com/r/0rtcFt/3)). – Wiktor Stribiżew Nov 07 '16 at 09:21
  • @WiktorStribiżew Thank you for the feedback. I updated my code. – Ibrahim Nov 07 '16 at 09:37
  • `,(?=[a-zA-Z\s@#$%^&*!()-_]+")` still [won't work in all cases](https://regex101.com/r/0rtcFt/5). You cannot solve the problem with a lookaround approach. – Wiktor Stribiżew Nov 07 '16 at 09:38
  • @WiktorStribiżew check out the new demo. – Ibrahim Nov 07 '16 at 09:40
  • And if we had the characters that are currently ignored? It'd be something like : ,(?=[a-zA-Zéèê_\-]+"), would it work? – Alan Nov 07 '16 at 09:43
  • *You cannot solve the problem with a lookaround approach* because a lookahead is not aware if the `"` ahead is the leading or trailing field delimiter. – Wiktor Stribiżew Nov 07 '16 at 09:43
  • Actually, it would work (for me). The lines I have aren't that messed up, I don't have strange characters like # or anything. So, for my case I'd work, but yep, I guess it wouldn't work for all the cases. – Alan Nov 07 '16 at 09:47
  • @Alan I think so. Try it out, and let me know if you had any problems. – Ibrahim Nov 07 '16 at 09:49
  • @WiktorStribiżew Well, it gets things done mostly, and that's what's count. Things don't have to be perfect all the time. – Ibrahim Nov 07 '16 at 09:50
  • @Ibrahim Well, in fact it doesn't work. He doesn't consider _, so when I have something like "HAY, LE HAYE", it goes undetected. – Alan Nov 07 '16 at 09:56
  • @Alan that's strange, because it works in regex101. check it out here: https://regex101.com/r/0rtcFt/9 – Ibrahim Nov 07 '16 at 10:00
  • @Ibrahim Well, two options : either I go with the ,(?=[^"]+") expression, and it selects all the commas in the line ; or I go with the ,(?=[a-zA-Z_]+") expression, and it doesn't get any comma ! (the line I used to test : 354,Décès,"HAY,LE HAYE",Marie,NULL,"HAY,LE HAYE",François,BOURASSEAU,Jeanne,FLOCELLIÈRE (LA),85,05/06/1820,294/367; ) – Alan Nov 07 '16 at 10:08
1

The correct approach here (since you commented that double quotes only appear as field delimiters) is to match double quoted substrings with a simple "[^"]+" regex and replace commas with hyphens only inside the matches.

Here is a sample code:

Sub CallbackTest()
Dim rxStr As RegExp
Dim s As String
Dim m As Object

s = """SOME,MORE,HERE"",728,""HAY,HAYE"",Marie,François,RAUTUREAU,85,29/05/1856,68;"

Set rxStr = New RegExp
rxStr.pattern = """[^""]+"""
rxStr.Global = True

For Each m In rxStr.Execute(s)
   s = Left(s, m.FirstIndex) & Replace(m.Value, ",", "-") & Mid(s, m.FirstIndex + Len(m.Value) + 1)
Next m
Debug.Print s              ' Print demo results
' => "SOME-MORE-HERE",728,"HAY-HAYE",Marie,François,RAUTUREAU,85,29/05/1856,68;

End Sub
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thanks a lot ! So, if I want to adapt this to my file, I have to remove the variable s, and to replace it with all the cells/lines of the file? – Alan Nov 07 '16 at 10:10
  • The `s` is the cell value, the input. You may use a temp value that you can assign the original cell value, and only change the cell contents if it was changed. Or add a `If rxStr.Test(s) Then For Each ... End If`. – Wiktor Stribiżew Nov 07 '16 at 10:27
1

If I got you correct, then there is no need for Regex at all.

With a simple Split you can do it too.

Private Sub simpleReplace()
  Dim str() As String, cell As Variant, Myrange As Range, i As Long
  Set Myrange = ActiveSheet.Range("A1:A2000")
  For Each cell In Myrange
    str = Split(cell.Value, """")
    If UBound(str) Then
      For i = 1 To UBound(str) Step 2
        str(i) = Replace(str(i), ",", "-")
      Next
      cell.Value = Join(str, """")
    End If
  Next
End Sub

this will split your 728,"HAY,HAYE",Marie,François,RAUTUREAU,85,29/05/1856,68; into:

(0) 728,
(1) HAY,HAYE
(2) ,Marie,François,RAUTUREAU,85,29/05/1856,68;

Now every second part of the Split (odd numbers) will be normally enclosed in ". So all that is left, is to Replace the commas there and push it into the cell again.

And if there is no " found, then it will skip this line (because the upper bound is 0)

If you still have any Questions or if that is not what you are looking for, pls tell :)

Dirk Reichel
  • 7,989
  • 1
  • 15
  • 31
  • Thanks a lot! It seems to work correctly, I'm gonna try to insert all the data in my tables in Postgresql, I keep you in touch ! – Alan Nov 07 '16 at 11:11
  • you are welcome... just tell if there is any trouble :) – Dirk Reichel Nov 07 '16 at 12:12
  • Well.. Turns out my file was corrumpted, and I didn't even had to modify my file. The strange characters, the double quotes, ..., I don't even know how they appeared, but they weren't supposed to be there. I lost a week-end of my life. Sigh. But thanks for the help ! – Alan Nov 07 '16 at 15:48