5

e.g. there is "2,881,423", how to remove the "," from it. I have millions data needs to be done. Is it possible to do a batch action? So any tools I can use either for PC for Mac.

"Position","Value",
"1","1",
"2","1",
"3","1",
"4","2",
"5","2",

...

"2,881,423","19",
"2,881,424","22",
"2,881,425","23",
"2,881,426","23",
"2,881,427","25",
"2,881,428","25",
"2,881,429","25",

...

Like above are some pieces from the csv.

user1987618
  • 51
  • 1
  • 1
  • 3
  • What you are going to do with this data ? Will you import it in database. OR any programming language you can may that can help you. Please let us know about that. – Lucifer Jan 17 '13 at 16:11
  • I need to analysis in another software called R http://cran.r-project.org It doesn't allow to have numbers which have commas inside. – user1987618 Jan 17 '13 at 16:15

3 Answers3

3

In Python:

import csv
with open("myfile.csv", "rb") as infile, open("output.csv", "wb") as outfile:
    reader = csv.reader(infile)
    writer = csv.writer(outfile)
    for row in reader:
        writer.writerow(item.replace(",", "") for item in row)
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • I'm not familiar with PHP at all, but this will remove ALL commas from the string, while we need to remove only 1000 separators... or I just have no clue? – Peter L. Jan 21 '13 at 17:06
  • @PeterL.: No, this will only remove the thousands separators. I'm using Python's `csv` module to read the CSV file (because quoted CSV fields may contain field separators, newlines, escaped quotes etc.), so my regex only has to deal with the field contents. Then I'm using it again to write a properly formatted CSV file back to disk. – Tim Pietzcker Jan 21 '13 at 17:49
  • Oh I see, PHP rules) The same job will do my VBA code, but you need only 4 PHP lines instead of mine whole page)) – Peter L. Jan 21 '13 at 18:20
  • 1
    @PeterL. I beg to differ: Python rules! :) – Tim Pietzcker Jan 21 '13 at 18:23
  • Won't this also remove commas in other strings ("Korea, Republic of", for instance)? Peter L's answer goes further and specifies that it should only remove the comma when it separates to digits. We could do this in Python, but it would not be quite as compact as this answer. – randy May 22 '16 at 23:03
  • @randy: I may have misunderstood Peter's comment - my version does remove all commas from each CSV field (but of course not the separators from the source file). It does not check whether those commas are between digits (which seems to be OK with the original set of data). – Tim Pietzcker May 23 '16 at 07:47
2

The following code will do the job - it will loop through all the files with given mask located in the folder:

Sub RemoveCommas()

Dim RegX_Comma As Object
'
Dim FileStream As Object
Dim FileContent As String
Dim SourceFolder As String
Dim FileName As String
'
Set RegX_Comma = CreateObject("VBScript.RegExp")
RegX_Comma.Pattern = "(?<=\d),(?=\d)" 'Comma between any digits
RegX_Comma.IgnoreCase = True
RegX_Comma.Global = True

Set FileStream = CreateObject("ADODB.Stream")
SourceFolder = "D:\DOCUMENTS\" 'Must be specified with trailing "\"

FileName = Dir(InputFolder & "*.txt") 'Specify ANY mask using wildcards, e.g. "*.csv*
Do While FileName <> ""

    FileStream.Open
    FileStream.Charset = "ASCII" 'Change encoding as required
    FileStream.LoadFromFile (SourceFolder & FileName)
    FileContent = RegX_Comma.Replace(FileStream.ReadText, "")
    FileStream.Position = 0
    FileStream.WriteText FileContent
    FileStream.SetEOS
    FileStream.SaveToFile SourceFolder & FileName, 2 'Will overwrite the existing file
    FileStream.Close

FileName = Dir
Loop

End Sub

Make the required modifications to the code according to the inline comments.

Good luck!)

Maks Gordeev
  • 351
  • 3
  • 16
Peter L.
  • 7,276
  • 5
  • 34
  • 53
  • I think you mean `(?<=\d),(?=\d)`. – Tim Pietzcker Jan 21 '13 at 18:23
  • @TimPietzcker Frankly RegExes are not yet my strength, but I tested `(?!\d),(?=\d)` with sample file (link in the answer) and it did the right job. – Peter L. Jan 21 '13 at 18:37
  • 1
    `(?!\d)` asserts that the *next* character *isn't* a digit, which is always True since your regex requires the next character to be a comma. You should check that the *previous* character *is* a digit. – Tim Pietzcker Jan 21 '13 at 18:50
  • @TimPietzcker Thanks, updated the code. Could you please recommend a brief page for such cases? There are plenty of online regex manuals, but they're either brief or too complicated) – Peter L. Jan 22 '13 at 06:01
0

Since your goal is to use the data in R, you could just do the substitution after you have read the data into R:

df <- Path/To/File.csv
df$varname <- as.numeric(gsub(",", "", df$varname))

where df is your data frame and varname is the name of the variable. This will not check for whether the comma is between two digits, so you will want to make sure you only pass the variables that you want to be numeric to this and not any string columns where the comma is actually part of the data.

Here is a similar question asking how to solve the problem from within R:

How to read data when some numbers contain commas as thousand separator?

Community
  • 1
  • 1
randy
  • 1,031
  • 10
  • 20