-2

I have some data that looks like this (more than 400 columns) :

year ID fake_num1 fake_num2 text1
2019 11 36 000 10'000 text, 1
2020 12 -1 275 1 000,00 text 2

Columns fake_num1 and fake_num2 are stored as text. What I'm trying to achieve is

  1. Identify those fake numbers columns
  2. Clean the data (e.g. remove space, columns, replace comma by points) with a for loop

I need some help with step 1. I have to identify columns fake_num1 and fake_num2, while avoiding columns like text1. I was thinking of going with regexp but maybe there is another solution.

I used part of the code here: SO regexp, however I am not sure how to proceed from there.

Dim strPattern as String: strPattern = "^[0-9]$"

will find anything that starts and ends with a number, and only has numbers (if my comprehension is correct). What's the best way to manage the cases listed in the table above ?

braX
  • 11,506
  • 5
  • 20
  • 33
DanielMc
  • 139
  • 12
  • Where is the data being stored, in an app (word, Excel?) a text file?. If you are using VBA then a function which tests a string to see if it only contains 'number characters' (i.e. the characters 0 to 9 and any number formatting chars like ".,-+" characters might be a way to go – freeflow Dec 28 '21 at 10:30
  • If a cel value in a column is not a number, should this mean that all that column is fake? If so, you should enumerate situations to make the value a number. Which is the decimal separator? A dot? Should `1 000,00` become `1000.00`? – FaneDuru Dec 28 '21 at 10:32
  • @freeflow It's stored in a csv file that I import into Excel. – DanielMc Dec 28 '21 at 10:32
  • @FaneDuru the decimal separator is the comma. The thousand separator is sometimes the space, sometimes the apostrophe. The posibilities are listed in the above table. – DanielMc Dec 28 '21 at 10:34
  • @freeflow How would you write the regexp to handle what you mention ? – DanielMc Dec 28 '21 at 10:34
  • I am asking about your real separator to make it a number... Otherwise, why `1 000,00` cannot be seen as a number? Not characters to be replaced. Why did you say "replace comma by points", if comma is the decimal separator? – FaneDuru Dec 28 '21 at 10:36
  • Oh ! the real one is the point, so ´1 000,00´ nor ´1 000´ aren't seen as numbers. – DanielMc Dec 28 '21 at 10:39

1 Answers1

2

Please, try the next code, It considers "fake numbers columns" as ones where replacing the necessary characters makes from string a number:

Sub testMakeNumbers()
 Dim sh As Worksheet, lastR As Long, lastCol As Long, i As Long, rngCol As Range
 
 Set sh = ActiveSheet 'you can use here the necessary sheet
 lastR = sh.Range("A" & sh.rows.Count).End(xlUp).row
 lastCol = sh.cells(1, Columns.Count).End(xlToLeft).Column
 
 'determine the problematic columns:
 For i = 1 To lastCol
    If Not IsNumeric(sh.cells(2, i).Value) And _
            IsNumeric(Replace(Replace(Replace(sh.cells(2, i).Value, " ", ""), "'", ""), ",", ".")) Then
        If rngCol Is Nothing Then
            Set rngCol = sh.cells(2, i)
        Else
            Set rngCol = Union(rngCol, sh.cells(2, i))
        End If
    End If
 Next
 'replace the characters making the string as number:
 With Intersect(rngCol.EntireColumn, sh.Range("A2", sh.cells(lastR, lastCol)))
       .Replace ",", "."
       .Replace Chr(160), ""
       .Replace " ", ""
       .Replace "'", ""
  End With
End Sub
FaneDuru
  • 38,298
  • 4
  • 19
  • 27
  • It seems it should work, but I don't know why it doesn't replace the space. If I use the table above and apply the code, the method ``Replace(Replace(Replace(sh.Cells(2, i).Value, " ", ""), "'", ""), ",", ".")`` does not work. That means that if condition ``IsNumeric()`` is not fulfilled, since it returns ``False``. – DanielMc Dec 28 '21 at 13:58
  • @DanielMc I don't think so... Please, create a test Sub containing three lines: `Dim x As String` `x = "36 '000,00"` `Debug.Print Replace(Replace(Replace(x, " ", ""), "'", ""), ",", ".")`. What does it return in `Immediate Window`? Try making `x = "whatever you want testing"` . Your above data example does not have headers for rows and columns. Is the second row the one to be analyzed to determine the problematic columns? The above code processes the second row... – FaneDuru Dec 28 '21 at 14:04
  • I haven't tested your proposition, I found the solution however. If I replace the ``" "`` by ``Chr(160)`` it works :) Thank you so much for your help I marked your solution as the answer, can you maybe add that the ``" "`` should in some cases be replaced by ``Chr(160)`` ? I used the excel `Code()`` function to find it. – DanielMc Dec 28 '21 at 14:13
  • To answer your question concerning the second row: not for all columns. Some columns have data only on row 3400... Is there a more efficient way than to loop through rows? – DanielMc Dec 28 '21 at 14:20
  • @DanielMc I couldn't know that the space is not a space...:) The ASCI code for a space is 32. 160 is something different interpreted as space by Excel conversion... OK, I will adapt the code to do that. – FaneDuru Dec 28 '21 at 14:20
  • @DanielMc The code does not loop through rows... It needs a **elocvent** second row to determine the problematic columns. I asked a clarification question on the issue... So, the second row should contain these strange numbers (in all columns). The code **treats all so found columns and replace on these ones**. But the code calculated the `lastR` (last row) on column A:A. If the problematic columns do have more rows, it builds the range according to this `lastR`... You can use the column you consider the most appropriate, from this point of view... Does that make any sense? – FaneDuru Dec 28 '21 at 14:24
  • @DanielMc Adapted the code to replace `Chr(160`, too. Letting the space replacing, just in case... :) – FaneDuru Dec 28 '21 at 14:26
  • @DanielMc If the second row is not relevant/elocvent, you can insert such a second row and place some relevant strings which can become numbers **in the columns which should become numbers**. Or, can a logic/algorithm to determine which header should be string be formulated? The second row is the key of the problem. I can let the code to run on all cells, but it will take a lot of time only for determining the problematic columns. You should know how is better, from this point of view, I think... – FaneDuru Dec 28 '21 at 14:31
  • Thank you so much for your help :) I don't know what elocvent means. Your assumption is sound, but I noticed only later that some data isn't necessarily on the 2nd row. I'm thinking of changing the selection to something like `` Selection.End(xlDown).Select`` so that it actually gets a non empty cell. Concerning the ``Chr(160`` you are completely right to leave the space as well. I just thought it might be useful to someone else. – DanielMc Dec 28 '21 at 14:39
  • @DanielMc Elocvent should mean that on second row **on problematic columns, which should be numbers to alterate the values, even if numbers in a way to trigger the code first part**. I mean, if there is "5000.50", make it "5 000,50" or only "5000,50" and **do the same for all these columns. Then, if the problematic columns may have more rows than A:A, use such a column to determine the last row (`lastR`). Any selection does not help at all. It only consumes Excel resources, not bringing any benefit. – FaneDuru Dec 28 '21 at 14:46
  • @DanielMc If your range to be processed always starts from the first row you can also use `lastR = sh.usedRange.Columns.Count`. It will surely cover all the necessary rows... – FaneDuru Dec 28 '21 at 14:48