1

Using .read() to read a file, how would I split on two objects at once? I'm trying to split on commas, and "\n" simultaneously, but when I split on commas first, it turns my string into a list, in which I cannot split again.

Here is the string I'm trying to split:

'States, Total Score, Critical Reading, Mathematics, Writing, Participation (%)\nWashington,1564,524,532,508,41.2000\nNewHampshire,1554,520,524,510,64.0000\nMassachusetts,1547,512,526,509,72.1000\nOregon,1546,523,524,499,37.1000\nVermont,1546,519,512,506,64.0000\nArizona,1544,519,525,500,22.4000\nConnecticut,1536,509,514,513,71.2000\nAlaska,1524,518,515,491,32.7000\nVirginia,1521,512,512,497,56.0000\nCalifornia,1517,501,516,500,37.5000\nNewJersey,1506,495,514,497,69.0000\nMaryland,1502,501,506,495,56.7000\nNorthCarolina,1485,497,511,477,45.5000\nRhodeIsland,1477,494,495,488,60.8000\nIndiana,1476,494,505,477,52.0000\nFlorida,1473,496,498,479,44.7000\nPennsylvania,1473,492,501,480,62.3000\nNevada,1470,496,501,473,25.9000\nDelaware,1469,493,495,481,59.2000\nTexas,1462,484,505,473,41.5000\nNewYork,1461,484,499,478,59.6000\nHawaii,1458,483,505,470,47.1000\nGeorgia,1453,488,490,475,46.5000\nSouthCarolina,1447,484,495,468,40.7000\nMaine,1389,468,467,454,87.1000\nIowa,1798,603,613,582,2.7000\nMinnesota,1781,594,607,580,6.0000\nWisconsin,1778,595,604,579,3.8000\nMissouri,1768,593,595,580,3.6000\nMichigan,1766,585,605,576,3.8000\nSouthDakota,1766,592,603,571,2.0000\nIllinois,1762,585,600,577,4.6700\nKansas,1752,590,595,567,4.7000\nNebraska,1746,585,593,568,3.9000\nNorthDakota,1733,580,594,559,3.4000\nKentucky,1713,575,575,563,5.0000\nTennessee,1712,576,571,565,6.4000\nColorado,1695,568,572,555,14.1000\nArkansas,1684,566,566,552,3.5000\nOklahoma,1684,569,568,547,3.8000\nWyoming,1683,570,567,546,3.6000\nUtah,1674,568,559,547,4.5000\nMississippi,1666,566,548,552,2.2000\nLouisiana,1652,555,550,547,4.0000\nAlabama,1650,556,550,544,5.4000\nNewMexico,1636,553,549,534,7.1000\nOhio,1609,538,548,522,17.2000\nIdaho,1601,543,541,517,14.6000\nMontana,1593,538,538,517,20.0000\nWest Virginia,1522,515,507,500,13.2000\n'

jamylak
  • 128,818
  • 30
  • 231
  • 230
Brian Lee
  • 39
  • 7
  • Your input file looks a lot like a [`CSV`](http://en.wikipedia.org/wiki/Comma-separated_values) file, so it might be simplest just to use Python's [`csv`](http://docs.python.org/2/library/csv.html) module. – Aya May 26 '13 at 17:52

4 Answers4

5

You can use a list comprehension:

>>> strs = 'States, Total Score, Critical Reading, Mathematics, Writing, Participation (%)\nWashington,1564,524,532,508,41.2000\nNewHampshire,1554,520,524,510,64.0000\nMassachusetts,1547,512,526,509,72.1000\nOregon,1546,523,524,499,37.1000\nVermont,1546,519,512,506,64.0000\nArizona,1544,519,525,500,22.4000\nConnecticut,1536,509,514,513,71.2000\nAlaska,1524,518,515,491,32.7000\nVirginia,1521,512,512,497,56.0000\nCalifornia,1517,501,516,500,37.5000\nNewJersey,1506,495,514,497,69.0000\nMaryland,1502,501,506,495,56.7000\nNorthCarolina,1485,497,511,477,45.5000\nRhodeIsland,1477,494,495,488,60.8000\nIndiana,1476,494,505,477,52.0000\nFlorida,1473,496,498,479,44.7000\nPennsylvania,1473,492,501,480,62.3000\nNevada,1470,496,501,473,25.9000\nDelaware,1469,493,495,481,59.2000\nTexas,1462,484,505,473,41.5000\nNewYork,1461,484,499,478,59.6000\nHawaii,1458,483,505,470,47.1000\nGeorgia,1453,488,490,475,46.5000\nSouthCarolina,1447,484,495,468,40.7000\nMaine,1389,468,467,454,87.1000\nIowa,1798,603,613,582,2.7000\nMinnesota,1781,594,607,580,6.0000\nWisconsin,1778,595,604,579,3.8000\nMissouri,1768,593,595,580,3.6000\nMichigan,1766,585,605,576,3.8000\nSouthDakota,1766,592,603,571,2.0000\nIllinois,1762,585,600,577,4.6700\nKansas,1752,590,595,567,4.7000\nNebraska,1746,585,593,568,3.9000\nNorthDakota,1733,580,594,559,3.4000\nKentucky,1713,575,575,563,5.0000\nTennessee,1712,576,571,565,6.4000\nColorado,1695,568,572,555,14.1000\nArkansas,1684,566,566,552,3.5000\nOklahoma,1684,569,568,547,3.8000\nWyoming,1683,570,567,546,3.6000\nUtah,1674,568,559,547,4.5000\nMississippi,1666,566,548,552,2.2000\nLouisiana,1652,555,550,547,4.0000\nAlabama,1650,556,550,544,5.4000\nNewMexico,1636,553,549,534,7.1000\nOhio,1609,538,548,522,17.2000\nIdaho,1601,543,541,517,14.6000\nMontana,1593,538,538,517,20.0000\nWest Virginia,1522,515,507,500,13.2000\n'
>>> [ y for x in strs.splitlines() for y in x.split(",")]
['States', ' Total Score', ' Critical Reading', ' Mathematics', ' Writing', ' Participation (%)', 'Washington', '1564', '524', '532', '508', '41.2000', 'NewHampshire', '1554', '520', '524', '510', '64.0000', 'Massachusetts', '1547', '512', '526', '509', '72.1000', 'Oregon', '1546', '523', '524', '499', '37.1000', 'Vermont', '1546', '519', '512', '506', '64.0000', 'Arizona', '1544', '519', '525', '500', '22.4000', 'Connecticut', '1536', '509', '514', '513', '71.2000', 'Alaska', '1524', '518', '515', '491', '32.7000', 'Virginia', '1521', '512', '512', '497', '56.0000', 'California', '1517', '501', '516', '500', '37.5000', 'NewJersey', '1506', '495', '514', '497', '69.0000', 'Maryland', '1502', '501', '506', '495', '56.7000', 'NorthCarolina', '1485', '497', '511', '477', '45.5000', 'RhodeIsland', '1477', '494', '495', '488', '60.8000', 'Indiana', '1476', '494', '505', '477', '52.0000', 'Florida', '1473', '496', '498', '479', '44.7000', 'Pennsylvania', '1473', '492', '501', '480', '62.3000', 'Nevada', '1470', '496', '501', '473', '25.9000', 'Delaware', '1469', '493', '495', '481', '59.2000', 'Texas', '1462', '484', '505', '473', '41.5000', 'NewYork', '1461', '484', '499', '478', '59.6000', 'Hawaii', '1458', '483', '505', '470', '47.1000', 'Georgia', '1453', '488', '490', '475', '46.5000', 'SouthCarolina', '1447', '484', '495', '468', '40.7000', 'Maine', '1389', '468', '467', '454', '87.1000', 'Iowa', '1798', '603', '613', '582', '2.7000', 'Minnesota', '1781', '594', '607', '580', '6.0000', 'Wisconsin', '1778', '595', '604', '579', '3.8000', 'Missouri', '1768', '593', '595', '580', '3.6000', 'Michigan', '1766', '585', '605', '576', '3.8000', 'SouthDakota', '1766', '592', '603', '571', '2.0000', 'Illinois', '1762', '585', '600', '577', '4.6700', 'Kansas', '1752', '590', '595', '567', '4.7000', 'Nebraska', '1746', '585', '593', '568', '3.9000', 'NorthDakota', '1733', '580', '594', '559', '3.4000', 'Kentucky', '1713', '575', '575', '563', '5.0000', 'Tennessee', '1712', '576', '571', '565', '6.4000', 'Colorado', '1695', '568', '572', '555', '14.1000', 'Arkansas', '1684', '566', '566', '552', '3.5000', 'Oklahoma', '1684', '569', '568', '547', '3.8000', 'Wyoming', '1683', '570', '567', '546', '3.6000', 'Utah', '1674', '568', '559', '547', '4.5000', 'Mississippi', '1666', '566', '548', '552', '2.2000', 'Louisiana', '1652', '555', '550', '547', '4.0000', 'Alabama', '1650', '556', '550', '544', '5.4000', 'NewMexico', '1636', '553', '549', '534', '7.1000', 'Ohio', '1609', '538', '548', '522', '17.2000', 'Idaho', '1601', '543', '541', '517', '14.6000', 'Montana', '1593', '538', '538', '517', '20.0000', 'West Virginia', '1522', '515', '507', '500', '13.2000']

If you want a list of lists containing each line split at ,:

>>> [x.split(",") for x in strs.splitlines()]
[['States', ' Total Score', ' Critical Reading', ' Mathematics', ' Writing', ' Participation (%)'], ['Washington', '1564', '524', '532', '508', '41.2000'], ['NewHampshire', '1554', '520', '524', '510', '64.0000'], ['Massachusetts', '1547', '512', '526', '509', '72.1000'], ['Oregon', '1546', '523', '524', '499', '37.1000'], ['Vermont', '1546', '519', '512', '506', '64.0000'], ['Arizona', '1544', '519', '525', '500', '22.4000'], ['Connecticut', '1536', '509', '514', '513', '71.2000'], ['Alaska', '1524', '518', '515', '491', '32.7000'], ['Virginia', '1521', '512', '512', '497', '56.0000'], ['California', '1517', '501', '516', '500', '37.5000'], ['NewJersey', '1506', '495', '514', '497', '69.0000'], ['Maryland', '1502', '501', '506', '495', '56.7000'], ['NorthCarolina', '1485', '497', '511', '477', '45.5000'], ['RhodeIsland', '1477', '494', '495', '488', '60.8000'], ['Indiana', '1476', '494', '505', '477', '52.0000'], ['Florida', '1473', '496', '498', '479', '44.7000'], ['Pennsylvania', '1473', '492', '501', '480', '62.3000'], ['Nevada', '1470', '496', '501', '473', '25.9000'], ['Delaware', '1469', '493', '495', '481', '59.2000'], ['Texas', '1462', '484', '505', '473', '41.5000'], ['NewYork', '1461', '484', '499', '478', '59.6000'], ['Hawaii', '1458', '483', '505', '470', '47.1000'], ['Georgia', '1453', '488', '490', '475', '46.5000'], ['SouthCarolina', '1447', '484', '495', '468', '40.7000'], ['Maine', '1389', '468', '467', '454', '87.1000'], ['Iowa', '1798', '603', '613', '582', '2.7000'], ['Minnesota', '1781', '594', '607', '580', '6.0000'], ['Wisconsin', '1778', '595', '604', '579', '3.8000'], ['Missouri', '1768', '593', '595', '580', '3.6000'], ['Michigan', '1766', '585', '605', '576', '3.8000'], ['SouthDakota', '1766', '592', '603', '571', '2.0000'], ['Illinois', '1762', '585', '600', '577', '4.6700'], ['Kansas', '1752', '590', '595', '567', '4.7000'], ['Nebraska', '1746', '585', '593', '568', '3.9000'], ['NorthDakota', '1733', '580', '594', '559', '3.4000'], ['Kentucky', '1713', '575', '575', '563', '5.0000'], ['Tennessee', '1712', '576', '571', '565', '6.4000'], ['Colorado', '1695', '568', '572', '555', '14.1000'], ['Arkansas', '1684', '566', '566', '552', '3.5000'], ['Oklahoma', '1684', '569', '568', '547', '3.8000'], ['Wyoming', '1683', '570', '567', '546', '3.6000'], ['Utah', '1674', '568', '559', '547', '4.5000'], ['Mississippi', '1666', '566', '548', '552', '2.2000'], ['Louisiana', '1652', '555', '550', '547', '4.0000'], ['Alabama', '1650', '556', '550', '544', '5.4000'], ['NewMexico', '1636', '553', '549', '534', '7.1000'], ['Ohio', '1609', '538', '548', '522', '17.2000'], ['Idaho', '1601', '543', '541', '517', '14.6000'], ['Montana', '1593', '538', '538', '517', '20.0000'], ['West Virginia', '1522', '515', '507', '500', '13.2000']]

Instead of generating the whole list at once you can use itertools.chain to get elements lazily (Or even better if you iterate over one line at once, prefer @Martijn Pieters's solution in that case):

>>> from itertools import chain
>>> for elem in chain(*(x.split(",") for x in strs.splitlines())):
...     print elem
...     
States
 Total Score
 Critical Reading
 Mathematics
 Writing
 Participation (%)
Washington
...
Community
  • 1
  • 1
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • Maybe I didn't understand right, but OP doesn't seem to want a flat list... – JBernardo May 26 '13 at 17:47
  • 1
    Basically, I just wanted to return an entire string with all elements split into strings and commas; all \n being removed and commas so the list would look something like this: ['States', 'Total Score', 'Critical Reading', etc.] and then for any cases where it looked like this: [1524,518,515,491,32.7000\nVirginia] I would be able to return a list that looked like this: ['1524', '518', '515', '491', '32.7000', 'Virginia']. – Brian Lee May 26 '13 at 17:50
  • @BrianLee My answer has covered both cases. – Ashwini Chaudhary May 26 '13 at 17:52
  • @AshwiniChaudhary Oh sorry, I didn't see your answer and I wanted to respond to JBernardo to clear everything up. Thank you very much. – Brian Lee May 26 '13 at 17:53
  • @BrianLee then this one is the right answer. +1 – JBernardo May 26 '13 at 17:54
3

Don't read the whole file in one go, read per line, then split:

with open(filepath) as f:
    for line in f:
        print line.strip().split(',')

You could also first split on newlines, then loop and split on commas:

lines = [line.split(',') for line in somestring.splitlines()]

But for comma-separated files, your best bet is to use the csv module:

import csv

with open(filepath, 'rb') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        print row

This gives you the rows as:

['States', ' Total Score', ' Critical Reading', ' Mathematics', ' Writing', ' Participation (%)']
['Washington', '1564', '524', '532', '508', '41.2000']
['NewHampshire', '1554', '520', '524', '510', '64.0000']

Since you have a first row with headers, you could use a DictReader as well and get dictionaries mapping headers to values:

with open(filepath, 'rb') as f:
    reader = csv.DictReader(f, delimiter=',')
    for row in reader:
        print row
        # address columns as: row['States'], row['Total Score']

which outputs rows as:

{' Writing': '508', ' Total Score': '1564', ' Critical Reading': '524', 'States': 'Washington', ' Mathematics': '532', ' Participation (%)': '41.2000'}
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
2

there's re.split for multiple characters split:

import re
re.split("\n| ","this is\na short\ntest...")
>>> ['this', 'is', 'a', 'short', 'test...']
zenpoy
  • 19,490
  • 9
  • 60
  • 87
0

you could use the split() from the re function where you are able to define a regex for the splitting

look at this: python split string based on regular expression

Community
  • 1
  • 1
pypat
  • 1,096
  • 1
  • 9
  • 19