2

I'm using Python to work with data from csv files, and after reading csv into an array, my data looks like this:

data = [
    ["10","2018-03-22 14:38:18.329963","name 10","url10","True"],
    ["11","2018-03-22 14:38:18.433497","name 11","url11","False"],
    ["12","2018-03-22 14:38:18.532312","name 12","url12","False"]
]

I know I can use "for" loop but my data has around millions of records and the "for" loop takes too long time to run, so do you have any idea to do task listed below without using "for"?

  1. Convert value from string to integer in column 1 (ie: "10" -> 10)
  2. Add "http://" in column 3 (ie: "url10" -> "http://url10")
  3. Convert value in column 4 to boolean (ie: "False" -> False)

Thank you a lot!

Nhan Tran
  • 113
  • 2
  • 8
  • 1
    Sounds like a use case for map (https://docs.python.org/3/library/functions.html#map). – jobnz Apr 17 '18 at 20:28

2 Answers2

2

You can use map with a predefined function. map is slightly faster than a list comprehension on larger input:

def clean_data(row):
   val, date, name, url, truthy = row
   return [int(val), date, name, 'http://{}'.format(url), truthy == 'True']


data = [
["10","2018-03-22 14:38:18.329963","name 10","url10","True"],
["11","2018-03-22 14:38:18.433497","name 11","url11","False"],
["12","2018-03-22 14:38:18.532312","name 12","url12","False"]
]
print(list(map(clean_data, data)))

Output:

[[10, '2018-03-22 14:38:18.329963', 'name 10', 'http://url10', True], [11, '2018-03-22 14:38:18.433497', 'name 11', 'http://url11', False], [12, '2018-03-22 14:38:18.532312', 'name 12', 'http://url12', False]]
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
0

Pandas should be one option if you don't mind taking some time to load your data to the dataframe first.

Below is one solution using Pandas, then simply compare the time cost with map solution.

import pandas as pd
from datetime import datetime
data = [
    ["10","2018-03-22 14:38:18.329963","name 10","url10","True"],
    ["11","2018-03-22 14:38:18.433497","name 11","url11","False"],
    ["12","2018-03-22 14:38:18.532312","name 12","url12","False"]
]*10000 #multiply 10000 to simulate large data, you can test with one bigger number.

#Pandas
df = pd.DataFrame(data=data, columns=['seq', 'datetime', 'name', 'url', 'boolean'])
pandas_beg = datetime.now()
df['seq'] = df['seq'].astype(int)
df['url'] = 'http://' + df['url']
df['boolean'] = df['boolean'] == 'True'
pandas_end = datetime.now()
print('pandas: ', (pandas_end - pandas_beg))

#map
def clean_data(row):
   val, date, name, url, truthy = row
   return [int(val), date, name, 'http://{}'.format(url), truthy == 'True']
map_beg = datetime.now()
result = list(map(clean_data, data))
map_end = datetime.now()
print('map: ', (map_end - map_beg))

Output:

pandas:  0:00:00.016091
map:  0:00:00.036025
[Finished in 0.997s]
Sphinx
  • 10,519
  • 2
  • 27
  • 45