0

I have a large text file containing many thousand lines but a short example that covers the same idea is:

vapor dust -2C pb 
x14 71 hello! 42.42
100,000 lover baby: -2

there is a mixture of integers, alphanumerics, and floats.

ATTEMPT AT SOLN. Ive done this to create a single list composed of strings, but I am unable to isolate each cell based on if its numeric or alphanumeric

with open ('file.txt','r') as f:
data = f.read().split()
#dirty = [ x for x in data if x.isnumeric()]
print(data)

The line #dirty fails.

I have had luck constructing a list-of-lists containing almost all required values using the code as follows:

with open ('benzene_SDS.txt','r') as f:  
    for word in f:
        data= word.split()
        clean = [ x for x in data if x.isnumeric()]            
        res = list(set(data).difference(clean))
        print(clean)

But It doesnt return a single list, it a list of lists, most of which are blank [].

There was a hint given, that using the "try" control statement is useful in solving the problem but I dont see how to utilize it.

Any help would be greatly appreciated! Thanks.

nikeros
  • 3,302
  • 2
  • 10
  • 26
Himi Chan
  • 3
  • 2
  • From your own example data, what do you expect the output to be? I.e. should 14 be included? And how about 100,000? – Grismar Jan 17 '22 at 07:20
  • @Grismar 14 is not included but the 100 would be , from the assignment "The function should identify numbers like "10823," that have a comma or other character after them # REQ2: Numbers with hyphens (or other non-numeric characters) within them like x14 or 727-8989 should be skipped. " – Himi Chan Jan 17 '22 at 07:44
  • Hi if it is any easier, we can assume that we only need the true integers and floating numbers in the example provided. I am mostly confused as to incorporation of the try statement. Sorry if my knowledge is bad! It is still new to me – Himi Chan Jan 17 '22 at 07:49
  • You should probably add an example of the expected output to your question, like I asked. For example, `'100,000'` would be considered a valid way to write `100000` for many regional settings, while for other regional settings, it might be considered `100.000`. It sounds like you only want entirely numeric values that comply with local regional settings, but values can be separated by both spaces and other separators like commas - it's unclear what would be valid separators though. How about `'123; 45-50, 60!'`? – Grismar Jan 17 '22 at 07:49
  • @Grismarit the prompt states the function should be able to identify numbers like "2019," that have a comma or other character after them. And that numbers with non-numeric characters within them should be skipped. So for your example, the function would return [123.0,60.0] – Himi Chan Jan 17 '22 at 07:53

2 Answers2

0
numbers = []
with open('file.txt','r') as f:
    for line in f.read():
        words = line.split()
        numbers.extend([word for word in words if word.isnumeric()])

# Print all numbers
print(numbers)

# Print all unique numbers
print(set(numbers))

# Print all unique numbers, converted to floats
print([float(n) for n in set(numbers)])

If you specifically need a list then you can wrap the set with list().

liveware
  • 72
  • 5
  • Note that `'100,000'` is not numeric, according to `.isnumeric`; it's unknown if OP wants numbers like `14` included, but of course that would be missed as well. – Grismar Jan 17 '22 at 07:24
  • Hi this is close but I need the full number value not their individual components. Such as 71 or 42.42 as in the example – Himi Chan Jan 17 '22 at 07:46
0

If you're mainly asking how one would use try to check for validity, this is what you're after:

values = []
with open ('benzene_SDS.txt','r') as f:  
    for word in f.read().split():
        try:
            values.append(float(word))
        except ValueError:
            pass
print(values)

Output:

[71.0, 42.42, -2.0]

However, not that this does not parse '100,000' as either 100 or 100000.

This code would do that:

import locale

locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

values = []
with open('benzene_SDS.txt', 'r') as f:
    for word in f.read().split():
        try:
            values.append(locale.atof(word))
        except ValueError:
            pass

print(values)

Result:

[71.0, 42.42, 100000.0, -2.0]

Note that running the same code with this:

locale.setlocale(locale.LC_ALL, 'nl_NL.UTF-8')

Yields a different result:

[71.0, 4242.0, 100.0, -2.0]

Since the Netherlands use , as a decimal separator and . as a thousands separator (which basically just gets ignored in 42.42)

Grismar
  • 27,561
  • 4
  • 31
  • 54
  • If this answers your question, consider ticking the checkmark to turn it green, to indicate your question no longer requires additional answers. – Grismar Jan 17 '22 at 07:56
  • However, note that this solution does not deal with interpunction, so numbers followed by other characters than `.` or `,` (or whatever is locally accepted) would still be ignored, i.e. numbers follow by a question mark or exclamation point. You would likely need a regular expression to parse numbers from that, but it would be substantially more challenging. – Grismar Jan 17 '22 at 07:58