0

I have several rows of text. The first row is a header row, and each subsequent line represents the fields of data, each value is separated with a comma. Within each line are one to three dollar values, ranging from single digit dollar values ($4.50) to triple digit ($100,000.34). They are also surrounded by quotes.

206360941,5465685679,"$4,073.77",567845676547,"$88,457.21",34589309683

I need to eliminate the quotations and dollar sign for the money values, as well as the comma inside. The period separator for the decimal value needs to stay, so "$6,801.56" becomes 6801.56

I've used regex to eliminate the dollar sign as well as quotations--

with open("datafile.csv", "r") as file:
    data = file.readlines()

for i in data:
    i = re.sub('[$"]', '', i)

which then makes the data look like 7545245,6,801.56,3545647 so if I split by a comma, it cuts larger values in two.

['206360941,5465685679,4,073.77,567845676547,88,457.21,34589309683']

I thought about splitting by quotations, doing some more regex and rejoining with .join() but it turns out that only the currency values with a comma contain quotations, the smaller values with no comma do not.

Also, I know I can use re.findall(r'\$\d{1,3}\,\d\d\d\.\d\d', i) to draw out the number format, if I print it, it will output a list like [$100,351.35]

I am just not sure what to do with it after that.

4 Answers4

1

This seems to work:

>>> data = '206360941,5465685679,"$4,073.77",567845676547,"$88,457.21",34589309683'
>>> re.findall(r'"\$((\d+),)*(\d+)(\.\d+)"', data)
[('4,', '4', '073', '.77'), ('88,', '88', '457', '.21')]
>>> re.sub(r'"\$((\d+),)*(\d+)(\.\d+)"', r'\2\3\4', data)
'206360941,5465685679,4073.77,567845676547,88457.21,34589309683'

The idea is to grab the data before and after the decimal point, keeping the latter as well. Then, given that the first group is identical to the second one, just replace with the contents of all groups except the first one. If there are more than one comma, you'll probably need a more dynamic approach.

That's why you need this ((\d+),)* group, which captures a subgroup and the comma. You should replace this whole group with the subgroup.

ForceBru
  • 43,482
  • 10
  • 63
  • 98
0

I'd recommend using csv.reader (or csv.DictReader if you want to do other processing on each column) to read the file as this will parse each column automatically. Once you read the file, you can do your regex on each column so no need to split the line yourself. The default delimiter and quotechar for csv.reader is as you would need, I believe.

tayfun
  • 3,065
  • 1
  • 19
  • 23
  • I thought about doing it this way. This may have been a better approach. I'll take some time to play around with it, I could use the practice. – BusterTheFatCat Jan 31 '19 at 15:43
0

Did you try the module locale? As in How do I use Python to convert a string to a number if it has commas in it as thousands separators? It'll be easier than regex.

Demi-Lune
  • 1,868
  • 2
  • 15
  • 26
0

First of all you could go about deleting all commas that are inside of quotes.

Pseudo code might look like:

s = Your String
insideQuotes = false;
charIndex = 0;
while (c = nextChar() != null){
    if(c == "\""){
        insideQuotes = !insideQuotes;
    }else if(insideQuotes && c == ","){
        s.removeAt(charIndex, "");
        charIndex--;
    }
}

Now that there are no more commas inside the quotes, you only need to remove the dollar signs and the quotes themselves!

Hope it helps!