2

1

According to Wikipedia

In computability theory,

Rice's theoremstates that all non-trivial, semantic properties of programs are undecidable. A semantic property is one about the program's behavior (for instance, does the program terminate for all inputs), unlike a syntactic property (for instance, does the program contain an if-then-else statement). A property is non-trivial if it is neither true for every computable function, nor false for every computable function.

A **syntactic property** asks a question about a computer program like "is there is a while loop?"

A **semantic properties** asks a question about the behavior of the computer program. For example, does the program loop forever (which is the Halting problem, which is undecidable, i.e., in general, there's no algorithm that can tell you if an arbitrarily given program halts or terminates for a given input)?

So, Rice's theorem proves all non-trivial semantic properties are undecidable (including whether or not the program loops forever).

AI is a computer program (or computer programs). These program(s), like all computer programs, can be modeled by a Turing machine (Church-Turing thesis).

Is safety (for Turing machines, including AI) a non-trivial semantic question? If so, is AI safety undecidable? In other words, can we determine whether an AI program (or agent) is safe or not?

I believe that this doesn't require formally defining safety.

I think an answer to your question may actually depend on the definition of "safe AI". For example, if you define a safe AI as any program that can run on a TM that does not contain a while loop (of course, this is a stupid example), then "AI safety" is a syntactic property. – nbro – 2020-04-03T18:39:55.210

nbro♦: Is there any way to prove AI safety if "safe" is not a syntactical definition? – Jared – 2020-04-04T00:48:29.313

I would need to think about it. I had never thought about it. – nbro – 2020-04-04T01:28:52.737